CN117725220A - Method, server and storage medium for document characterization and document retrieval - Google Patents

Method, server and storage medium for document characterization and document retrieval Download PDF

Info

Publication number
CN117725220A
CN117725220A CN202311378060.2A CN202311378060A CN117725220A CN 117725220 A CN117725220 A CN 117725220A CN 202311378060 A CN202311378060 A CN 202311378060A CN 117725220 A CN117725220 A CN 117725220A
Authority
CN
China
Prior art keywords
document
representation
documents
graph
semantic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311378060.2A
Other languages
Chinese (zh)
Inventor
张凯
宋凯嵩
康杨杨
刘晓钟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Original Assignee
Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Alibaba Cloud Feitian Information Technology Co ltd filed Critical Hangzhou Alibaba Cloud Feitian Information Technology Co ltd
Priority to CN202311378060.2A priority Critical patent/CN117725220A/en
Publication of CN117725220A publication Critical patent/CN117725220A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method for document characterization and document retrieval, a server and a storage medium. According to the method, a plurality of documents to be processed and document association graphs of the plurality of documents are obtained, wherein the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; the method comprises the steps of inputting content information of a plurality of documents and a document association graph into a document representation model, performing diagrammatical learning through the document representation model based on the document association graph initialized by using semantic representations of the documents, updating feature representations of the nodes, taking the updated feature representations of the nodes as document representations of corresponding documents, learning semantic information of single documents and related information among the documents based on the document association graph, improving quality of the document representations, and further improving accuracy of document retrieval, recommendation and classification based on the document representations.

Description

Method, server and storage medium for document characterization and document retrieval
Technical Field
The present disclosure relates to computer technology, and in particular, to a method, a server, and a storage medium for document characterization and document retrieval.
Background
Many documents do not exist independently in practical application scenarios, such as documents containing hyperlinks, scientific papers with citation relationships, books with common tenderers/bidders, and so forth. In most of practical applications, the multi-document scenes are related to, for example, the scenes of searching, recommending, classifying, summarizing, and the like of documents such as scientific papers, books, and the like, and the multi-documents with association relations are involved.
In traditional document characterization learning, technicians usually pay attention to acquisition of semantic information in a single document, mainly use a language model to generate document representations based on semantic information of the single document in terms of words, sentences and the like, and cannot learn correlation information among the documents, and the quality of the document representations is low, so that the retrieval, recommendation and classification accuracy based on the document representations is low.
Disclosure of Invention
The application provides a method, a server and a storage medium for document characterization and document retrieval, which are used for solving the problem that the low accuracy of retrieval, recommendation and classification based on the document representation is caused by low quality of the document representation obtained by the existing document characterization method.
In a first aspect, the present application provides a document characterization method, including:
Acquiring a plurality of documents to be processed and a document association graph of the plurality of documents, wherein the document association graph comprises nodes corresponding to the documents and edges representing association relations among the documents; inputting the content information of the plurality of documents and the document association graph into a document representation model, performing graph feature learning based on the document association graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node; the characteristic representation of each node after updating is used as the document representation of the corresponding document.
In a second aspect, the present application provides a document retrieval method, including:
responding to a document retrieval request, and acquiring query information input by a user; mapping the query information into vector representations, and performing similarity matching on the vector representations and document representations of all documents in a document retrieval library to obtain at least one target document matched with the query information; outputting information of the at least one target document; wherein the document representation of each document in the document retrieval library is determined by: acquiring a plurality of documents to be characterized in a document retrieval library and document association graphs of the plurality of documents, wherein the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; inputting the plurality of documents and the document association graph into a document representation model, performing graph sign learning based on the document association graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node; the characteristic representation of each node after updating is used as the document representation of the corresponding document.
In a third aspect, the present application provides a server comprising:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the server to perform the method of the first or second aspect.
In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions which, when executed by a processor, implement the method according to the first or second aspect.
According to the method, the server and the storage medium for document characterization and document retrieval, a plurality of documents to be processed and document association graphs of the plurality of documents are obtained, and the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; the method comprises the steps of inputting content information of a plurality of documents and a document association graph into a document representation model, performing diagrammatical learning through the document representation model based on the document association graph initialized by using semantic representations of the documents, updating feature representations of the nodes, and taking the updated feature representations of the nodes as document representations of corresponding documents, so that semantic information of single documents can be learned, correlation information among the documents can be learned based on the document association graph of the plurality of documents, the quality of the document representations is improved, and further, the accuracy of document retrieval, recommendation and classification based on the document representations can be improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a schematic diagram of an exemplary system architecture to which the present application is applicable;
FIG. 2 is a schematic diagram of another example system architecture to which the present application applies;
FIG. 3 is a flowchart of a document characterization method provided in an exemplary embodiment of the present application;
FIG. 4 is an exemplary architecture diagram of a document representation model provided in an exemplary embodiment of the present application;
FIG. 5 is a diagram illustrating an exemplary architecture of a semantic information learning module according to an exemplary embodiment of the present application;
FIG. 6 is a diagram illustrating an exemplary structure of a document characterization model provided in an exemplary embodiment of the present application;
FIG. 7 is a flowchart of a training method for a text representation model provided in an exemplary embodiment of the present application;
FIG. 8 is an exemplary diagram of a training framework for a document characterization model provided in an exemplary embodiment of the present application;
FIG. 9 is a flowchart of a method for document characterization according to an exemplary embodiment of the present application;
FIG. 10 is a flowchart of a document retrieval method provided in another exemplary embodiment of the present application;
fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application.
Specific embodiments thereof have been shown by way of example in the drawings and will herein be described in more detail. These drawings and the written description are not intended to limit the scope of the inventive concepts in any way, but to illustrate the concepts of the present application to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the accompanying claims.
It should be noted that, the user information (including but not limited to user equipment information, user attribute information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with related laws and regulations and standards, and provide corresponding operation entries for the user to select authorization or rejection.
The terms referred to in this application are explained first:
document: computers are used to generally refer to files produced by various types of text editors as documents. In this embodiment, a document refers to a file in which text content is recorded, and document data contains file content, and related information of creator, creation time, type, use, belonging field, and the like of the file.
Graph data (Graph): i.e. the graph, is an ordered binary set, denoted g= (V, E), where v= { V 1 ,...,v N And the node set is represented by the sequence,representing the edge set. The nodes in the graph have characteristic representations, and the edges represent the association relationship between two connected nodes.
Characterization learning (Representation Learning, RL for short): is a subtask in natural language processing, focusing mainly on how to map semantic information etc. in text into a vector representation that can be used for downstream tasks.
Fig. neural network (Graph Neural Network, GNN for short): the method is characterized in that a neural network is used for learning graph structure data, features and modes in the graph structure data are extracted and discovered, the algorithm general term for meeting the requirements of graph learning tasks such as clustering, classification, prediction, segmentation and generation is mainly focused on how to apply a learning paradigm of the neural network to the graph data structure, and then information of nodes and topological structures on the graph is learned.
Wo Sesi distance (Wasserstein Distance, WD): is a distance used to measure the similarity of two statistical distributions.
Drawing pooling: is an operation for decomposing a graph with similar structure into a small set of nodes. It may select node information representing the entire graph representation by global pooling operations such as max-pooling or average pooling. The pooling operation reduces the amount of data and reduces the number of nodes through some algorithm, thereby realizing layer-by-layer extraction. Global pooling involves only the readout layer, which can read out the representation of the entire map using a pooling operation. Global pooling is often used in tasks such as graph classification.
Pre-training a large language model: a pre-training model obtained by pre-training a large-scale language model (Large Language Model, abbreviated as LLM).
In conventional document characterization learning, technicians typically compare the acquisition of semantic information in a single document of interest. However, in the practical application scenario, many documents do not exist independently, such as documents containing hyperlinks, scientific papers with citation relations, books with common tenderers/bidders, and so on. In most of practical applications, the multi-document scenes are related to, for example, the scenes of searching, recommending, classifying, summarizing, and the like of documents such as scientific papers, books, and the like, and the multi-documents with association relations are involved.
The traditional document representation learning scheme mainly uses a language model to generate document representations based on semantic information of single documents in terms of words, sentences and the like, correlation information among the documents cannot be learned, the quality of the document representations is low, and therefore the retrieval, recommendation and classification accuracy based on the document representations is low.
Aiming at the technical problems, the application provides a document characterization method, which is used for acquiring a plurality of documents to be processed and document association graphs of the plurality of documents, wherein the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; the method comprises the steps of inputting content information of a plurality of documents and a document association graph into a document representation model, generating semantic representations of the documents based on the content information of the documents through the document representation model, initializing feature representations of nodes in the document association graph by using the semantic representations of the documents, performing graph feature learning based on the initialized document association graph, updating the feature representations of the nodes, and taking the updated feature representations of the nodes as document representations of corresponding documents, so that semantic information of single documents can be learned, and correlation information among the documents can be learned based on the document association graph of the plurality of documents, the quality of the document representations is improved, and further, the accuracy of document retrieval, recommendation and classification based on the document representations can be improved.
FIG. 1 is a schematic diagram of an example system architecture to which the present application is applicable. As shown in fig. 1, the system architecture includes a server and an end-side device. The server and the end side equipment are provided with a communication link capable of communicating, so that communication connection between the server and the end side equipment can be realized.
The server is a device with computing capability deployed in the cloud or locally, such as a cloud cluster. The server stores a trained document representation model, and can implement a document characterization function based on the document representation model. In addition, the server may also be responsible for implementing training of the document representation model.
The end-side device may be an electronic device running a downstream application, and specifically may be a hardware device having a network communication function, an operation function, and an information display function, which includes, but is not limited to, a smart phone, a tablet computer, a desktop computer, a local server, a cloud server, and the like. The end-side device needs to use the document representation capabilities of the document representation model when running downstream applications. The downstream application run by the end-side device may be an application system that performs at least one document processing task, such as document retrieval, document recommendation, document classification, etc., for example, tagbook searching, tagbook classification, document retrieval, document classification, etc. In implementing the functionality of the downstream application, it is desirable to use the document representation capabilities of the document representation model to generate high quality document representations of individual documents based on a plurality of documents in a given document set, and further to implement the document processing tasks of the downstream application based on the document representations of the individual documents.
Based on the system architecture shown in fig. 1, the end-side device transmits a plurality of documents to be processed to the server. The server receives a plurality of documents to be processed and constructs a document association diagram of the plurality of documents, wherein the document association diagram comprises nodes corresponding to the documents and edges representing association relations among the documents. Further, the server inputs content information of a plurality of documents and a document association diagram into a document representation model, performs graph feature learning based on the document association diagram initialized by using semantic representations of the documents through the document representation model, updates feature representations of the nodes, and takes the feature representations of the nodes after updating as document representations of the corresponding documents. The server returns document representations of the plurality of documents to the end-side device.
The terminal side equipment receives the document representations of the plurality of documents returned by the server, continues to execute the document processing logic of the downstream application according to the document representations of the plurality of documents, realizes the document processing task of the downstream application, and obtains the document processing result.
In an example scenario, taking a downstream application as an example of a document classification system, when classifying a given plurality of documents, an end-side device needs to obtain text representations of the respective documents, and then classify based on the document representations of the respective documents to obtain a document classification result. The terminal side equipment sends a document set to be classified to a server, wherein the document set comprises a plurality of documents, and at least two documents have association relations. The server receives a document set sent by the terminal side equipment and constructs a document association diagram based on association relation among the documents. The server inputs content information of a plurality of documents in a document set and a document association diagram into a document representation model, performs graph feature learning based on the document association diagram initialized by using semantic representations of the documents through the document representation model, updates feature representations of the nodes, and takes the feature representations of the nodes after updating as document representations of corresponding documents. The server returns document representations of a plurality of documents in the document collection to the end-side device. The terminal side equipment receives document representations of a plurality of documents in a document set returned by the server, classifies the documents according to the document representations of the documents, and obtains classification results of the documents.
Based on the system architecture shown in fig. 1, other document processing tasks such as document clustering, generating a summary of multiple documents, and the like can also be implemented, which are not particularly limited herein. For example, classification of a taggant, a document, clustering of a taggant, a document, specifying a summary of a plurality of documents, and the like.
Fig. 2 is a schematic diagram of another example system architecture to which the present application applies. As shown in fig. 2, the system architecture includes a server and an end-side device. The server and the end side equipment are provided with a communication link capable of communicating, so that communication connection between the server and the end side equipment can be realized.
The server is a device with computing capability deployed in the cloud or locally, such as a cloud cluster. The server stores a trained document representation model, and can implement a document characterization function based on the document representation model. The server may characterize a plurality of documents in a given document retrieval library using a document representation model, obtain a document representation of each document in the document retrieval library, and store the document representation of each document in the document retrieval library. Further, the server can realize tasks such as document retrieval, recommendation and the like based on a document retrieval library containing document representations of the documents. In addition, the server may also be responsible for implementing training of the document representation model.
The terminal device may be an electronic device used by a user, and specifically may be a hardware device having a network communication function, an operation function, and an information display function, which includes, but is not limited to, a smart phone, a tablet computer, a desktop computer, and the like.
Based on the system architecture shown in fig. 2, a user sends a document processing request to a server through an end-side device, wherein the document processing request contains input information related to the user and at least one of the following items is included: query information entered by the user, user portraits, user history behavior (including but not limited to browsing, clicking, searching, collecting, referencing, focusing on) related document information. The server receives input information related to the user, maps the input information related to the user into vector representations, performs similarity matching on the vector representations and document representations of all documents in the document retrieval library to obtain at least one target document matched with the input information, and returns information of the target document to the terminal side equipment. And the terminal side equipment receives the information of the target document returned by the server and displays the information of the target document to the user.
In an example scenario, taking a document retrieval scenario as an example, a user sends a document retrieval request to a server through an end-side device, the document retrieval request containing query information entered by the user. The server obtains the query information input by the user, maps the query information into vector representations, performs similarity matching on the vector representations and document representations of all documents in the document retrieval library, and obtains at least one target document matched with the query information to obtain a document retrieval result. Further, the server returns information of the target document to the end-side device. And the terminal side equipment receives the information of the target document returned by the server, displays the information of the target document to the user, and completes document retrieval. The document retrieval scene may specifically be a tagbook retrieval, a document retrieval, a retrieval of knowledge documents in various technical fields, and the like, and is not particularly limited herein.
Based on the system architecture shown in fig. 2, document recommendation can be realized, and according to a first document related to user behaviors, a target document with higher similarity with the document representation of the first document is searched in a document search library and recommended to a user, so that personalized document recommendation based on the user behaviors is realized. For example, a bid recommendation, a literature recommendation, a recommendation of knowledge documents in various technical fields, and the like are not particularly limited herein.
The following describes the technical solutions of the present application and how the technical solutions of the present application solve the above technical problems in detail with specific embodiments. The following embodiments may be combined with each other, and the same or similar concepts or processes may not be described in detail in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
FIG. 3 is a flowchart of a document characterization method provided in an exemplary embodiment of the present application. The execution body of the embodiment may be a server running a document representation model, and specifically may be a server in any of the foregoing system architectures. As shown in fig. 3, the method specifically comprises the following steps:
step S301, a plurality of documents to be processed and document association diagrams of the plurality of documents are obtained, wherein the document association diagrams comprise nodes corresponding to the documents and edges representing association relations among the documents.
The method is suitable for performing multi-document characterization learning on a plurality of documents with certain relevance, and obtaining high-quality document representations of all the documents, and can be particularly applied to application scenes such as document classification, clustering, retrieval, recommendation, multi-document summarization and the like. When applied to different application scenarios, the plurality of documents to be processed may be different. For example, the plurality of documents to be processed may be documents in a set of documents given by the user/downstream application to be categorized or clustered, or all or part of the documents in a particular document retrieval library. The content of the document to be processed may include information such as text, images, program code, hyperlinks, etc., without limitation.
The association relationship between documents represents the relativity between documents, and can be specifically cited, the same author, the same target object, the same description object and the like. For example, documents of the paper class may have a reference relationship between them, documents of the bidding class may belong to the same bidder/signer, documents of the knowledge class may describe the same item, etc. In different application scenarios, the association relationship between documents may be different for different document sets. The association relation between different documents in a given document set of a specific application scene can be determined by adopting a data analysis and mining mode, and can be obtained by adopting the prior art, and the association relation is not particularly limited.
And for a given plurality of documents, constructing nodes corresponding to the documents, and establishing edges between any two corresponding nodes of the documents with the association relation according to the association relation between the documents to obtain a document association diagram of the plurality of documents. It should be noted that the same document association graph may include one or more edges of different types, where the edges of different types represent different association relationships.
In this step, the end-side device optionally provides a plurality of documents to be processed to the server. The server acquires a plurality of documents to be processed from the terminal side equipment, analyzes the association relation among the plurality of documents, and constructs a document association diagram of the plurality of documents based on the association relation among the plurality of documents. The document association graph comprises nodes corresponding to the documents and edges representing association relations among the documents.
Optionally, the end-side device provides the server with a plurality of documents to be processed, and a document association graph of the plurality of documents. The server receives a plurality of documents transmitted by the terminal side device and a document association diagram of the plurality of documents. The end-side device is responsible for constructing a document association graph of a plurality of documents.
Step S302, inputting content information of a plurality of documents and a document association graph into a document representation model, performing graph feature learning based on the document association graph initialized by using semantic representations of the documents through the document representation model, and updating feature representations of the nodes.
In this embodiment, the server inputs content information of each document and a document association graph into a document representation model, the document representation model generates semantic representations of each document based on the content information of each document, initializes feature representations of each node in the document association graph by using the semantic representations of each document obtained based on single document representation, performs graph feature learning on the initialized document association graph, updates the feature representations of each node, and outputs an updated document association graph. Therefore, on the basis of semantic representation of each document obtained based on single document representation, the topological structure information of the document association graph is learned through the graph feature learning, namely the correlation information among a plurality of documents is learned, and the quality of the feature representation of each node in the document association graph is improved.
The content information of the document input to the document representation model may include the complete content information of the document, or may be part of content information of a pre-specified document, such as a document body, a abstract, a key paragraph, etc., which is not specifically limited herein.
In an alternative embodiment, the server may obtain semantic representations of each document using other models or algorithms, initialize document association graphs of a plurality of documents using the semantic representations of each document, input the initialized document association graphs into the document representation model, perform graph feature learning based on the input initialized document association graphs, update feature representations of each node, and output updated document association graphs.
Step S303, the characteristic representation of each updated node is used as the document representation of the corresponding document.
And taking the high-quality characteristic representation of each node in the updated document association diagram as the document representation of the document corresponding to the node, so that the high-quality document representation of each document can be obtained.
Further, based on the high-quality document representation of each document, document processing tasks such as document retrieval, recommendation, classification, clustering, multi-document summarization and the like can be realized, so that the accuracy of document processing is improved.
By taking document clustering as an example, the method of the embodiment can obtain high-quality document representations of a plurality of documents to be clustered, and cluster the documents based on the obtained document representations to obtain a document clustering result, so that the accuracy of document clustering can be improved.
Illustratively, taking document retrieval as an example, a high-quality document representation of each document in the document retrieval library can be obtained by the method of the present embodiment. When the document retrieval is carried out, the vector representation of the input information related to the user is subjected to similarity matching with the high-quality document representation in the document retrieval library, the target document matched with the input information is determined, the document retrieval result is obtained, and the accuracy and recall rate of the document retrieval can be improved.
According to the method, a plurality of documents to be processed and document association graphs of the plurality of documents are obtained, the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents, content information of the plurality of documents and the document association graphs are input into a document representation model, semantic representations of the documents are generated based on the content information of the documents through the document representation model, feature representations of the nodes in the document association graphs are initialized through the semantic representations of the documents, graph feature learning is conducted based on the initialized document association graphs, feature representations of the nodes are updated, the feature representations of the nodes after updating are used as document representations of the corresponding documents, semantic information of single documents can be learned, correlation information among the documents can be learned based on the document association graphs of the plurality of documents, the quality of the document representations is improved, and further precision of document retrieval, recommendation and classification based on the document representations can be improved.
FIG. 4 is an exemplary architecture diagram of a document representation model provided in an exemplary embodiment of the present application, in an alternative embodiment, as shown in FIG. 4, in the foregoing method embodiment, the document representation model includes a semantic information learning module and a relevance information learning module. The semantic information learning module is used for: and carrying out semantic information characterization learning on the content information of each document to obtain semantic representation of each document. The correlation information learning module is used for: initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document, performing graph feature learning based on the initialized document association graph, and updating the characteristic representation of each node.
Based on the model architecture shown in fig. 4, the foregoing step S302 is specifically implemented as follows:
step S3021, inputting content information of a plurality of documents into a semantic information learning module, performing semantic information characterization learning on the content information of each document through the semantic information learning module to obtain semantic representations of each document, and inputting the semantic representations of each document into a relativity learning module.
In the embodiment, semantic information characterization learning is performed on the content information of the single document through the semantic information learning module, so that semantic information of the level of words, sentences and the like in the document can be well learned, and high-quality semantic representation of the single document is obtained. The obtained high-quality semantic representation of the single document can be output to a subsequent relevance learning module for initializing the feature representation of each node in the document association graph.
Step S3022, inputting the document association graph into a correlation information learning module, initializing the feature representation of each node in the document association graph by using the semantic representation of each document through the correlation information learning module, and performing graph feature learning based on the initialized document association graph to update the feature representation of each node.
In this embodiment, the document association graph is used as an input of the relevance information learning module, and the relevance information learning module uses the high-quality semantic representation of each document output by the semantic information learning module to initialize the feature representation of each node in the document association graph, so that the feature representation of each node in the initialized document association graph is the high-quality semantic representation of the corresponding document. Further, the correlation information learning module performs graph feature learning based on the initialized document association graph, updates the feature representation of each node, and can learn the topological structure information of the document association graph based on the document association graph of multiple documents, wherein the topological structure information of the document association graph contains the correlation information among the documents, so that the correlation information among the documents can be learned in the process of updating the feature representation of the nodes, the feature representation of the updated nodes contains the semantic information of a single document and the correlation information among the documents, and the feature representation of the updated nodes is used as the document representation of the corresponding document, thereby improving the quality of the document representation.
In another alternative embodiment, the document representation model may also include only a relevance information learning module, which is implemented using the graph neural network GNN. When multi-document characterization is carried out, semantic representation of a single document is carried out on content information of a plurality of documents through an existing document characterization method, semantic representation of each document is obtained, characteristic representation of each node in a document associated graph of the plurality of documents is initialized through the semantic representation of each document, an initialized document associated graph is obtained, the initialized document associated graph is input into a correlation information learning module to carry out graph characterization learning, and the characteristic representation of each node in the document associated graph is updated.
Fig. 5 is a diagram illustrating a structural example of a semantic information learning module according to an exemplary embodiment of the present application, and in an alternative embodiment, as shown in fig. 5, based on the model architecture shown in fig. 4, the semantic information learning module includes: the system comprises an entity relation diagram construction module, a first diagram neural network, a text representation model and a semantic representation fusion module.
The entity relation diagram construction module is used for: and constructing an Entity relation diagram (E-R diagram for short) of each document according to the content information of each document. The first graph neural network is used for: and respectively performing sign learning on the entity relation graph of each document to obtain feature representations of the entities contained in each document, and fusing the feature representations of the entities contained in each document to obtain semantic representations of the entity levels of each document. The first graph neural network may be implemented by a graph convolution neural network (Graph Convolutional Network, abbreviated as GCN), a constructivity graph neural network (Neural Network for Graphs, abbreviated as NN 4G), a graph annotation force network (Graph Attention Network, abbreviated as GAT), a graph isomorphic network (Graph Isomorphism Network, abbreviated as GIN), or other graph neural networks, so that the representation learning of the graph data can be implemented, which is not specifically limited herein.
The entity relation graph (E-R graph) of the document is a graph structure constructed by taking named entities contained in the document as nodes and taking the relation among the named entities as edges. Specifically, based on content information of any document, named entities contained in the document and relations among the named entities are extracted, corresponding nodes are created based on the named entities contained in the document, edges among the corresponding nodes are built based on the relations among the named entities, and an entity relation graph of the document is built. The feature vector of the node in the entity relationship graph may be a vector representation of the corresponding named entity, and may specifically be determined according to a word vector of the named entity. For example, word segmentation is performed for any named entity. If the named entity contains only one word after word segmentation, the word vector of the word is used as the vector representation of the named entity and is used as the characteristic representation of the corresponding node of the named entity. If the named entity comprises a plurality of words, the word vectors of the words are averaged to obtain the vector representation of the named entity, and the vector representation is used as the characteristic representation of the corresponding node of the named entity. In addition, the feature vectors of the nodes in the entity-relationship graph may also be determined by random initialization or other methods, which are not specifically limited herein.
Optionally, the first graph neural network may include a graph coding layer and a graph pooling layer, where the graph coding layer is configured to perform graph feature learning on the entity relationship graph of each document, code feature representations of each node in the entity relationship graph, update feature representations of nodes in the entity relationship graph, and output the updated entity relationship graph. And taking the characteristic representation of each node in the updated entity relation diagram as the characteristic representation of the corresponding entity, and obtaining the characteristic representation of the entity contained in each document. The updated entity relation graph is input to a graph pooling layer, and the graph pooling layer is used for performing graph pooling operation on the updated entity relation graph to realize fusion of characteristic representations of all nodes and output graph representations of the entity relation. The graph representation of the entity relationship is used as the semantic representation of the corresponding document, wherein the semantic representation is obtained by fusing the characteristic representations of the entities contained in the document and is the semantic representation of the entity level of the document.
Optionally, the first graph neural network may include a graph coding layer, where the graph coding layer is configured to perform graph feature learning on the entity relationship graph of each document, code feature representations of each node in the entity relationship graph, update feature representations of nodes in the entity relationship graph, and output the updated entity relationship graph. And taking the characteristic representation of each node in the updated entity relation diagram as the characteristic representation of the corresponding entity, and obtaining the characteristic representation of the entity contained in each document. Further, feature representations of entities contained in any document are averaged or spliced to achieve fusion of feature representations of all nodes, and semantic representations of entity levels of the document are obtained.
In this embodiment, the text representation model is used to: and respectively carrying out text characterization on the content information of each document to obtain semantic representation of the text level of each document. The text representation model may be implemented using any of the existing text representation models/algorithms. Illustratively, the text representation Model may be implemented using various types of Language Models (LM), such as a transducer-based bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT), a large Language Model, a pre-trained Language Model, and the like.
The semantic representation fusion module is used for: and fusing the semantic representation of the entity level and the semantic representation of the text level of each document to obtain the semantic representation of each document. Specifically, the semantic representation fusion module obtains semantic representations of all documents by averaging semantic representations of entity levels and semantic representations of text levels of all documents; or, splicing the semantic representation of the entity level and the semantic representation of the text level of each document to obtain the semantic representation of each document.
Based on the structural example of the semantic information learning module shown in fig. 5, the foregoing step S3021 may be specifically implemented as follows:
Inputting the content information of a plurality of documents into an entity relation diagram construction module, and constructing an entity relation diagram of each document according to the content information of each document through the entity relation diagram construction module; inputting the entity relation graph of each document into a first graph neural network, and respectively performing graph sign learning on the entity relation graph corresponding to each document through the first graph neural network to obtain semantic representation of the entity level of each document; inputting the content information of a plurality of documents into a text representation model, and respectively carrying out text representation on the content information of each document through the text representation model to obtain semantic representation of a text level of each document; and fusing the semantic representation of the entity level and the semantic representation of the text level of each document through a semantic representation fusion module to obtain the semantic representation of each document. The semantic representations of the documents are output to a relevance learning module.
In another alternative embodiment, the semantic information learning module may be implemented using a text representation model, and may be capable of extracting semantic information that includes content text from a single document. Illustratively, the semantic information learning module may be implemented using a Language Model (LM). Such as a transducer-based bi-directional encoder representation (Bidirectional Encoder Representations from Transformers, BERT for short), a large language model, a pre-trained language model, etc.
In the aforementioned step S3021, a plurality of documents are input into a text representation model, and each document is text-represented by the text representation model, so that a text-level semantic representation of each document is obtained as a semantic representation of each document.
In another alternative embodiment, the semantic information learning module may be implemented based on a graph neural network GNN, and construct an entity relationship graph of the document by extracting named entities and relationships between named entities contained in the document based on content information of the single document. The nodes in the entity relation graph represent named entities contained in the content information of the document, and the feature vectors of the nodes can be vector representations of the corresponding named entities, and the edges in the entity relation graph represent the relations among the named entities. And inputting the entity relation diagram of the document into a graph neural network GNN of a semantic information learning module for graph characterization learning, updating the feature vectors of the nodes in the entity relation diagram, obtaining the encoded entity relation diagram, and taking the feature vectors of the nodes in the encoded entity relation diagram as the encoding vectors of the named entities. And integrating the code vectors of the named entities contained in the document to obtain the semantic representation of the document. Specifically, when the coding vectors of the named entities contained in the document are integrated to obtain the semantic representation of the document, the coding vectors of the named entities contained in the document can be averaged to obtain the semantic representation of the document. Or, the coded entity relation graph is subjected to graph pooling, and the graph representation of the whole entity relation graph is read out and used as the semantic representation of the corresponding document.
In the aforementioned step S3021, the entity relationship graph of each document is input into a graph neural network for semantic information learning, and graph feature learning is performed on the entity relationship graph of each document through the graph neural network, so as to obtain the semantic representation of the entity hierarchy of each document, which is used as the semantic representation of each document.
In any of the foregoing method embodiments, the entity-relationship diagram of any document is obtained by: named entity recognition (Named Entity Recognition, NER for short) and relation extraction (Relation Extraction, RE for short) are performed on the content text of the document, and the entities (named entities) contained in the document and the relation between the entities are obtained. And constructing nodes corresponding to the entities contained in the document, and constructing edges according to the relation among the entities to obtain an entity relation diagram of the document. The named entity recognition NER and the relation extraction RE for the content text of the document may be implemented by using the existing named entity recognition method and the existing relation extraction method, which are not limited herein. Note that the content of the document may include information such as text, images, program code, hyperlinks, and the like. When the named entity identification and relation extraction are performed on the content text of the document, the named entity identification and relation extraction can be performed on the basis of the text content (including program codes, hyperlinks and the like) in the document, so that the entities (namely named entities) contained in the document and the relation among the entities can be obtained. Or, image recognition can be performed on the image in the document to obtain text information contained in the image in the document, the text information contained in the image in the document is also used as the content text of the document, and the named entity recognition and relation extraction are performed according to the text content in the document and the text information contained in the image to obtain entities (namely named entities) contained in the document and the relation among the entities.
In an alternative embodiment, based on the model architecture shown in fig. 4, the correlation information learning module includes: an initialization module and a second graph neural network. Wherein, the initialization module is used for: and initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document. The second graph neural network is used for: and performing graph feature learning based on the initialized document association graph, and updating the feature representation of each node.
In this embodiment, the foregoing step S3022 may be specifically implemented as follows:
inputting semantic representations of all documents and the document associated graph into an initialization module, initializing feature representations of all nodes in the document associated graph by using the semantic representations of all documents through the initialization module, inputting the initialized document associated graph into a second graph neural network, performing graph feature learning based on the initialized document associated graph through the second graph neural network, and updating the feature representations of all nodes.
The second graph neural network comprises a graph coding layer, the graph coding layer is used for performing graph feature learning on the initialized document association graph, coding feature representations of all nodes in the document association graph, updating the feature representations of the nodes in the document association graph, and outputting the updated document association graph. And taking the characteristic representation of each node in the updated document association graph as the document representation of the corresponding document, and obtaining the document representation of each document.
The second graph neural network may be specifically implemented by using a graph roll-up neural network GCN, a constructivity graph neural network NN4G, a graph attention network GAT, a graph isomorphic network GIN, or other graph neural networks, so as to enable the representation learning of the graph data, which is not specifically limited herein. Alternatively, the second graph neural network and the graph coding layer of the first graph neural network of the semantic information learning module may be implemented by adopting the same structure, but do not share parameters.
In another alternative embodiment, the relevance information learning module may include only the second graph neural network, and before inputting the document relevance graph into the relevance information learning module, the semantic representation of each document is used to initialize the feature representation of each node in the document relevance graph, obtain the initialized document relevance graph, and input the initialized document relevance graph into the relevance information learning module.
Illustratively, fig. 6 is a diagram illustrating a structure of a document characterization model according to an exemplary embodiment of the present application, and as shown in fig. 6, the document characterization model includes a semantic information learning module and a relevance information learning module. Wherein, the semantic information learning module includes: the system comprises an entity relation diagram construction module, a first diagram neural network, a language model and a semantic representation fusion module. The correlation information learning module comprises an initialization module and a second graph neural network. When multi-document representation is carried out, inputting content information of a plurality of documents into a language model, respectively carrying out text representation on the content information of each document through the language model to obtain semantic representation of a text level of each document, and inputting the semantic representation into a semantic representation fusion module; inputting the content information of a plurality of documents into an entity relationship diagram construction module, and constructing an entity relationship diagram of each document according to the content information of each document through the entity relationship diagram construction module; inputting the entity relation graph of each document into a first graph neural network, respectively performing graph feature learning on the entity relation graph of each document through the first graph neural network to obtain feature representations of the entities contained in each document, fusing the feature representations of the entities contained in each document to obtain semantic representations of the entity levels of each document, and inputting the semantic representations into a semantic representation fusion module. And fusing the semantic representation of the entity level and the semantic representation of the text level of each document through a semantic representation fusion module to obtain the semantic representation of each document. The semantic representation of each document is input into an initializing module of the correlation information learning module, the semantic representation of each document is used by the initializing module to initialize the characteristic representation of each node in the document association graph, the initialized document association graph is obtained, and the initialized document association graph is input into a second graph neural network. The second graph neural network performs graph feature learning based on the initialized document association graph, and updates feature representation of each node; the characteristic representation of each node after updating is used as the document representation of the corresponding document.
FIG. 7 is a flowchart of a training method for a text representation model according to an exemplary embodiment of the present application. In an alternative embodiment, as shown in FIG. 7, the training process of the document representation model used in the foregoing method embodiment is as follows:
step S701, constructing a document representation model.
In this embodiment, a document representation model based on the model architecture shown in fig. 4 is constructed, and specifically, the model architecture in any of the foregoing method embodiments may be used, for example, fig. 4 may be combined with other alternative embodiments to determine a model structure, or the model structure shown in fig. 6, which is not limited herein specifically.
Step S702, inputting a plurality of document samples in a document set for training and document association graphs of the plurality of document samples into a document representation model, masking initial edges in the document association graphs of the plurality of document samples through the document representation model, initializing feature representations of nodes in the document association graphs of the plurality of document samples by using semantic representations of the document samples, obtaining an initialized document association graph, performing graph feature learning based on the initialized document association graph, and updating the feature representations of the nodes.
In this embodiment, the training data for training the text representation model includes a plurality of document sets, where any document set includes a plurality of document samples, and the plurality of document samples in the document set have a certain correlation, where a correlation exists between some of the document samples. The plurality of document sets can be obtained by grouping a plurality of documents in the existing document characterization training data, and the plurality of documents in each group form a document set. The association relation among the documents is the own characteristics of the documents, and is determined by analyzing the content of the documents or the related attribute information (such as authors, time, purposes, belonging fields, sources and the like) of the documents or is obtained by manually marking. In addition, the division of the document set may be automatically performed by the server according to the relevant attributes such as the neighborhood, the source, etc. to which the document belongs, or the document may be randomly divided by the server, or may be performed by a manual division by a relevant technician, which is not specifically limited herein.
In this step, the process of obtaining the document association graphs of the plurality of document samples based on the plurality of document samples in one document set is consistent with the implementation manner of obtaining the document association graphs of the plurality of documents in the foregoing step S301, and details of the related content in the foregoing embodiment are specifically referred to and will not be described herein.
The specific implementation manner of the step S702 is similar to the foregoing step S302, except that in the process of initializing the document association diagram, only the feature representation of each node is initialized in the step S302, but in the step S702, not only the feature representation of each node is initialized, but also the mask processing is performed on the initial edge in the document association diagram, and other processing procedures are consistent with the step S302, and detailed descriptions thereof are omitted herein.
In this embodiment, a training task of the text representation model is a task of predicting edges between nodes in the document associated graph, and in order to construct the training task, in the process of initializing the document associated graph of a plurality of document samples, not only semantic representations of each document sample are used to initialize feature representations of each node in the document associated graph of the plurality of document samples, but also mask processing is required to be performed on the initial edges in the document associated graph of the plurality of document samples, so as to obtain the initialized document associated graph. In the subsequent step S703, the association relationship between the nodes is predicted, that is, whether or not edges exist between the nodes is predicted, based on the updated feature representation of the nodes.
For example, a graph G may be represented as an ordered binary set, represented as g= (V, E), where v= { V 1 ,...,v N And the node set is represented by the sequence,representing the edge set. Graph G may be stored as a feature matrix X εR N×d And adjacency matrix A.epsilon.R N×N And (5) associating. The feature matrix X is a matrix formed by the feature representation of the nodes in the graph, and d is the dimension of the feature representation of the nodes. N is the number of nodes contained in graph G. The adjacency matrix A records whether any two nodes in the graph have edges or not, and A ij =1 represents node v i And v j With edges between A ij =0 denotes node v i And v j Without edges therebetween. Masking initial edges in a document association graph of a plurality of document samples may cover 1's in the adjacency matrix with a specified mask (e.g., 0's or NULL's), where NULL indicates an uncertainty as to whether an edge exists.
Step S703, predicting the association relation among the nodes in the document association graph according to the updated characteristic representation of the nodes.
After the feature representation of each node after updating is obtained, the feature representation of each node can be input into a first classifier, and whether the association relationship exists among the nodes in the document association graph is classified and predicted through the first classifier, so that the association relationship among the nodes in the document association graph is determined, and a prediction result of the association relationship among the nodes in the document association graph is obtained. The prediction result includes whether edges exist between nodes in the document association graph (i.e. whether association exists).
The first classifier for predicting the association relationship between the nodes in the document association graph may be implemented by using a Multi-Layer Perceptron (MLP for short), or other classification models or algorithms, for example, classification algorithms based on a support vector machine, a decision tree, etc., which are not specifically limited herein.
Step S704, calculating a first loss according to the prediction result of the association relation among the nodes in the document association diagram and the initial edge in the document association diagram, and updating the parameters of the document representation model according to the first loss.
In the step, cross-Entropy (CE) loss is calculated as a first loss according to a prediction result of association relation among nodes in a determined document association graph based on the updated feature representation of each node and an initial edge in the document association graph. Further, according to the first loss, the parameters of the document representation model are updated in a back-propagation manner. Specifically, a gradient descent method or other common parameter updating methods can be adopted to update the parameters of the document representation model. After training is completed, a trained document representation model can be obtained.
It should be noted that, the first classifier used in step S703 may be a pre-trained first classifier, and parameters of the first classifier remain unchanged during the training process of the document characterization model; or, in the training process of the document representation model, the parameters of the first classifier and the document representation model are updated based on the first loss, so that the classification accuracy of the first classifier can be continuously improved in the training process, and the training effect of the document representation model can be improved.
According to the method, semantic representation is carried out on single documents, semantic representation containing rich semantic information of each document is obtained, association relations among multiple documents are modeled, the documents are used as nodes, the association relations among the documents are used as edges, and a document association graph among the multiple documents is built. The topology of the document association graph contains the relevance information of the document. In the training process of the text representation model, semantic representations of all documents containing rich semantic information are used for initializing feature representations of nodes in a document association graph, edges in the document association graph are subjected to mask processing by using a mask strategy and then input into a graph neural network for graph feature learning, the feature representations of the nodes are updated, association relations among the nodes are presumed according to the updated feature representations of the nodes, loss is calculated according to a prediction result and an initial edge of the document association graph, and model parameters are updated, so that the document representation model has the capability of learning document semantic information and the capability of learning document correlation information, and the quality of multi-document characterization can be improved.
In an alternative embodiment, based on the structure of the semantic information learning module shown in fig. 5, the semantic information learning module includes: the system comprises an entity relation diagram construction module, a first diagram neural network, a text representation model and a semantic representation fusion module. The specific structure is referred to in the related content of the foregoing embodiment, and will not be described herein. The semantic representation of each document sample is used by the document representation model in the step 702, which is specifically implemented in the following manner:
Inputting the content information of each document sample in the document set into an entity relation diagram construction module, and constructing an entity relation diagram of each document sample according to the content information of each document sample through the entity relation diagram construction module; inputting the entity relation graph of each document sample in the document set into a first graph neural network, and respectively performing graph sign learning on the entity relation graph of each document sample through the first graph neural network to obtain semantic representation of the entity level of each document sample; inputting key content information of a plurality of document samples of a document set into a text representation model, and respectively carrying out text representation on the key content information of each document sample through the text representation model to obtain semantic representation of a text level of each document sample; and fusing the semantic representation of the entity level and the semantic representation of the text level of each document sample to obtain the semantic representation of each document sample.
In an ideal case, the information of the entity level can help the information of the text level pay more attention to the useful information, and the weight on the useless information is reduced; while text-level information may enrich the information at the entity level. However, in the actual training process, the applicant finds that training the text representation model and the first graph neural network alone can result in the two levels of information being difficult to balance.
In this embodiment, in the training process of the document representation model, in order to balance learning between the text representation model and the first graph neural network, a second loss is added to restrict learning of two levels of information. The method is realized by the following steps:
predicting class labels of all the document samples according to semantic representations of entity levels of all the document samples to obtain a first prediction result, and predicting class labels of all the document samples according to semantic representations of text levels of all the document samples to obtain a second prediction result; calculating the distance between the first predicted result and the second predicted result to obtain a second loss; and updating parameters of the semantic information learning module according to the second loss.
The category label of the document sample represents the category of the document sample, specifically, the field, the theme, the type and the like of the document sample belong to, and the actual category label of the document sample is determined by pre-marking.
Illustratively, predicting the category labels for each document sample based on the semantic representation of the entity hierarchy of each document sample may be implemented using a second classifier. And inputting semantic representations of entity levels of the document samples into a second classifier, and carrying out classification prediction on the categories of the document samples through the second classifier to obtain a first prediction result. The first prediction result includes a distribution of category labels for each document sample. The distribution of the class labels of any document sample is a discrete distribution, and can be represented as a vector with dimension K, wherein K is the total number of the class labels, 1 in the vector indicates that the document has a corresponding class label, and 0 in the vector indicates that the document does not have a corresponding class label. Illustratively, the first prediction result may be a complete discrete distribution formed by the distribution concatenation of the category labels of each document sample. The second classifier may be implemented by using a Multi-Layer Perceptron (MLP), or other classification models or algorithms, such as a classification algorithm based on a support vector machine, a decision tree, etc., which is not specifically limited herein.
Illustratively, a third classifier implementation may be used in predicting the category labels for each document sample based on the semantic representation of the text hierarchy of each document sample. And inputting the semantic representation of the text level of each document sample into a third classifier, and carrying out classification prediction on the category of the document sample through the third classifier to obtain a second prediction result. The second result contains a distribution of category labels for each document sample similar to the information items contained in the first predicted result. The third classifier may be implemented by using a Multi-Layer Perceptron (MLP), or other classification models or algorithms, such as a classification algorithm based on a support vector machine, a decision tree, etc., which is not specifically limited herein.
The second classifier and the third classifier may be implemented by using the same classifier or different classifiers, which are not specifically limited herein. In addition, the second classifier and the third classifier may be pre-trained classifiers, and parameters of the second classifier and the third classifier remain unchanged during training of the document characterization model.
Optionally, in the training process of the document characterization model, calculating a first cross entropy loss based on the first prediction result and an actual class label of the document sample, and updating parameters of the second classifier based on the first cross entropy loss; and calculating a second cross entropy loss based on the second prediction result and the actual class label of the document sample, and updating parameters of the third classifier based on the second cross entropy loss, so that the classification accuracy of the two classifiers is continuously improved in the training process, and the training effect of the document representation model can be improved.
Alternatively, the distance between the first prediction result and the second prediction result may specifically be a Wo Sesi tan (wastertein) distance between two discrete distributions of the first prediction result and the second prediction result, a KL (Kullback-Leibler) divergence, a manhattan distance, or other information for measuring similarity between two discrete distributions, which is not specifically limited herein. In a preferred embodiment, a Wo Sesi-tan (wasperstein) distance between the first predicted result and the second predicted result is calculated, and as the second loss, wo Sesi-tan (wasperstein) distance is used to balance learning between the first graph neural network and the text representation model, so that the effect of model training can be improved.
Optionally, the first cross entropy loss can be calculated according to the first prediction result and the actual category label of each document sample; and/or calculating a second cross entropy loss according to the second prediction result and the actual class label of each document sample; according to the first cross entropy loss and/or the second cross entropy loss, determining a third loss, and updating parameters of the semantic information learning module according to the second loss and the third loss, the performance of the semantic information learning module can be improved, and therefore the quality of semantic representation of the generated document is improved.
Alternatively, the first cross entropy loss may be taken as a third loss, or the second cross entropy loss may be taken as a third loss, or the sum of the first cross entropy loss and the second cross entropy loss may be taken as a third loss.
Alternatively, when the parameters of the semantic information learning module are updated according to the second loss and the third loss, the sum of the second loss and the third loss may be used as the integrated semantic learning loss, or the weighted sum of the second loss and the third loss may be used as the integrated semantic learning loss, and the parameters of the semantic information learning module may be updated according to the integrated semantic learning loss. The weight coefficients of the second loss and the third loss may be configured and adjusted according to the needs of the actual application scenario, which is not specifically limited herein.
Alternatively, after the first loss, the second loss, and the third loss are calculated, a comprehensive loss may be calculated according to the first loss, the second loss, and the third loss, and parameters of the entire document characterization model may be updated according to the comprehensive loss. For example, a sum of the first loss, the second loss, and the third loss is calculated as a comprehensive loss; alternatively, the first, second, and third losses are weighted and summed as a composite loss. The weight coefficients of the first loss, the second loss and the third loss can be configured and adjusted according to the needs of the actual application scene, and are not particularly limited herein.
Illustratively, fig. 8 is an exemplary diagram of a training framework of a document characterization model according to an exemplary embodiment of the present application, and based on the document characterization model shown in fig. 6, a first classifier corresponding to the first graph neural network, a second classifier corresponding to the language model, and a third classifier corresponding to the second graph neural network are added in the training process. During the training of the document characterization model, each document sample (P as shown in FIG. 8 A ...P B ) Is input into a first graph neural network to perform graph feature learning, and each document sample is obtained to contain the feature representation of the entity (e is shown in figure 8 1 、e 2 、e 5 ,e 5 、e 6 、e 9 ) Further, semantic representations of the entity levels of the respective document samples may be determined, as shown in FIG. 8Where superscript E denotes the entity hierarchy. Inputting semantic representations of entity levels of all the document samples into a first classifier to conduct classification prediction to obtain a first prediction result, namely mu shown in FIG. 8 E
And, inputting abstract information of each document sample into a language model for text representation to obtain semantic representation of text level of each document sample, as shown in FIG. 8Where the superscript T denotes the text hierarchy. Inputting the semantic representation of the text level of each document sample into a second classifier to conduct classification prediction to obtain a second prediction result, such as mu shown in FIG. 8 T
By calculating the first prediction result mu E And a second prediction result mu T A Wo Sesi (Wasserstein) distance d between the two, as a second loss, to constrain the learning of the first graph neural network and the language model, and calculate cross entropy loss (not shown in FIG. 8) in combination with the actual class labels, and then determine the comprehensive semantic information learning loss l 1 (not shown in fig. 8).
The semantic representation of the entity level and the semantic representation of the text level of each document are fused, the feature representation of each node in the document association graph is initialized by using the semantic representation of each document obtained by fusion, and the edges in the document association graph are masked (the broken line represents the edges in the document association graph shown in fig. 8 are masked, and the solid line represents the actual existing edges or the predicted existing edges) so as to obtain the initialized document association graph. Inputting the initialized document association graph into a second graph neural network to perform graph feature learning, and updating the feature representation of each node. And predicting whether edges exist among the nodes according to the updated characteristic representation of each node through a third classifier, and obtaining an edge prediction result. Calculating cross entropy loss as a first loss based on the prediction result of the association relation between nodes in the document association diagram and the initial edge in the document association diagram, as shown in FIG. 8 2
Wherein l 1 Parameters for updating a first graph neural network and a language model, l 2 Parameters for updating the second graph neural network; alternatively, based on l 1 +l 2 Updating parameters of the first graph neural network, the language model and the second graph neural network.
It should be noted that, when the structures of the document characterization models are different, the process flow of performing multi-document characterization or model training based on the document characterization models may be different, and the description of the various optional model structures in the foregoing embodiments is specifically referred to, and various possible implementations may be obtained through the combination of the embodiments, which are not described in detail.
In the foregoing model training scheme, the semantic information learning module and the correlation information learning module of the document representation model are obtained by performing joint training. In other embodiments, the semantic information learning module and the correlation information learning module of the document representation model may be trained separately, and the semantic information learning module and the correlation information learning module may be trained separately. In the process of training the semantic information learning module, the first loss is not required to be calculated, only the second loss and the third loss are calculated, and according to the second loss and the third loss, the parameters of the semantic information learning module are updated, and the first loss is not required to be calculated. In the process of training the correlation information learning module, a first loss is calculated, and parameters of the document representation model are updated according to the first loss without calculating a second loss and a third loss. The calculation manner of each loss refers to the relevant content in the foregoing model training embodiment, and will not be described herein.
Fig. 9 is a flowchart of a document characterization method according to an exemplary embodiment of the present application, where the embodiment is based on the system architecture shown in fig. 1, and the execution body of the embodiment is a server in the system architecture shown in fig. 1. As shown in fig. 9, the method specifically comprises the following steps:
step S901, a document set submitted by a receiving end side device, where the document set includes a plurality of documents to be processed.
In this embodiment, when the end device needs to characterize a plurality of documents in the document set to obtain a document representation in the process of running the downstream application, the document set is submitted to the server. The server receives a document set submitted by the terminal side device, so that a plurality of documents to be processed contained in the document set are obtained.
Illustratively, when the end-side device submits the document set to the server, the document set may be uploaded through the front-end interface; or, the end-side device submits a request containing the document set to the server, so that the server acquires the data contained in the request and returns a result. For example, the end-side device sends a call request containing a document set to the server through a call interface that invokes a document characterization service provided by the server. In addition, the file may be submitted to the server by other end sides, which is not described herein.
Step S902, constructing a document association diagram corresponding to a document set, wherein the document association diagram comprises nodes corresponding to all documents and edges representing association relations among the documents.
After obtaining a document set containing a plurality of documents to be processed, the server can automatically construct a document association diagram corresponding to the document set, namely, construct the document association diagram of the plurality of documents in the document set. Specifically, for a plurality of documents contained in a document set, constructing nodes corresponding to the documents; and establishing edges between nodes corresponding to any two documents with the association relation according to the association relation among the documents to obtain a document association diagram of a plurality of documents in the document set. It should be noted that the same document association graph may include one or more edges of different types, where the edges of different types represent different association relationships.
The association relationship between the documents represents the relativity between the documents, and concretely can be a reference, the same author, the same target object, the same description object and the like. For example, documents of the paper class may have a reference relationship between them, documents of the bidding class may belong to the same bidder/signer, documents of the knowledge class may describe the same item, etc. In different application scenarios, the association relationship between documents may be different for different document sets. The association relation among different documents in the document set of the specific application scene can be determined in a data analysis and mining mode, and the association relation can be obtained by adopting the prior art, and is not particularly limited.
In the present embodiment, the processing procedure of acquiring a plurality of documents to be processed and document association diagrams of the plurality of documents in step S301 is realized by steps S901 to S902. In another embodiment, the document association graph corresponding to the document set may also be constructed by an end-side device, which submits the document set and the document association graph corresponding to the document set to the server.
Step S903, inputting content information of a plurality of documents and a document association graph into a document representation model, generating semantic representations of the documents based on the content information of the documents through the document representation model, initializing feature representations of the nodes in the document association graph by using the semantic representations of the documents, performing graph feature learning based on the initialized document association graph, and updating the feature representations of the nodes.
This step is consistent with the specific implementation of step S302, and the details of the foregoing embodiment are not described herein.
Step S904, the characteristic representation of each updated node is used as the document representation of the corresponding document.
This step is consistent with the specific implementation of step S303, and the details of the foregoing embodiment are not described herein.
Step S905, a document representation of each document in the document set is transmitted to the end-side apparatus.
According to the method, a server receives a document set submitted by end side equipment, a document association diagram corresponding to the document set is constructed, content information of a plurality of documents and the document association diagram are input into a document representation model, semantic representations of the documents are generated based on the content information of the documents through the document representation model, feature representations of nodes in the document association diagram are initialized by the semantic representations of the documents, graph feature learning is conducted based on the initialized document association diagram, feature representations of the nodes are updated, end-to-end multi-document representation can be achieved, and the end side equipment is not required to provide complex data such as the document association diagram.
In this embodiment, after the document representation of each document obtained by the document characterization, the server returns the document representation of each document in the document set to the end-side device. After the terminal side device receives the document representations of the documents in the document set, the terminal side device can continue to execute the document processing logic of the downstream application according to the document representations of the documents, so that the document processing task of the downstream application is realized, and the document processing result is obtained.
Illustratively, taking a document classification scenario as an example, an end-side device may submit a document set containing a plurality of documents to be classified to a server. The server characterizes a plurality of documents in the document set to obtain document representations of the respective documents, and returns the document representations of the respective documents to the end-side device. And the terminal side equipment classifies the documents according to the document representations of the documents to obtain a document classification result.
Illustratively, taking a document retrieval scenario as an example, an end-side device may submit a document set containing a plurality of documents in a retrieval library to a server. The server characterizes the plurality of documents contained in the document set to obtain document representations of the plurality of documents in the search library, and returns the document representations of the plurality of documents in the search library to the terminal side device. The end-side device updates/stores document representations of the plurality of documents in the search repository. Further, the terminal side device performs document retrieval based on the replaced retrieval library. For example, the terminal device maps the query information into a vector representation according to the query information input by the user, performs similarity matching between the vector representation and the document representations of the documents in the search library, obtains one or more documents matched with the query information as search results, and outputs the search results.
When the method of the present embodiment is applied to each downstream application, after obtaining the document representations of each document, the server may continue to execute the set processing logic according to the document representations of the plurality of documents, obtain the document processing result, and return the document processing result to the end-side device.
For example, in a document classification scenario, an end-side device may submit a document set containing a plurality of documents to be classified to a server. After the server characterizes a plurality of documents in the document set to obtain document representations of the documents, the server classifies the documents according to the document representations of the documents to obtain document classification results, and sends the document classification results to the terminal equipment.
For example, in a document association prediction scenario, an end-side device may submit a document set containing a plurality of documents to be processed to a server. After the server characterizes a plurality of documents in the document set to obtain document representations of the documents, the server predicts association relations among the documents according to the document representations of the documents and sends the association relations among the documents to the terminal side equipment. Alternatively, the server characterizes a plurality of documents in the document set to obtain document representations of the respective documents, and then returns the document representations of the respective documents in the document set to the end-side device. And the terminal equipment predicts the association relation among the documents according to the document representation of the documents to obtain the association relation among the documents.
Illustratively, in a document retrieval scenario, the server performs document characterization on a plurality of documents in a document retrieval library based on the method of the foregoing method embodiment, to obtain a document representation of each document; further, the server updates the document representations of the documents in the document retrieval library based on the document representations of the plurality of documents.
Further, the server can also realize the processing of online document retrieval based on the updated document retrieval library. Specifically, the server receives a document retrieval request sent by the terminal-side device, the document retrieval request containing input query information. The server maps the query information into vector representations, performs similarity matching on the vector representations and document representations of all documents in the document retrieval library to obtain at least one target document matched with the query information, and sends the at least one target document to the terminal side equipment.
Fig. 10 is a flowchart of a document retrieval method according to another exemplary embodiment of the present application, where an execution body of the present embodiment may operate a server of a local or cloud end of a document retrieval system, where a document retrieval library is stored, and document representations of a plurality of documents in the document retrieval library are obtained by the method of the foregoing document representation embodiment. As shown in fig. 10, the method specifically comprises the following steps:
in step S1001, a document search request sent by a receiving-side device includes input query information.
Wherein the input query information may include at least one of: keywords entered by the user through the input box of the front-end interface, at least one information item selected by the user from the front-end interface (e.g., optional keywords, screening criteria, etc.).
Step S1002, mapping the query information into vector representations, and performing similarity matching on the vector representations and document representations of all documents in a document retrieval library to obtain at least one target document matched with the query information.
The mapping of the query information to the vector representation may be specifically implemented by any method in the prior art for converting text information to the vector representation, which is not specifically limited herein.
When the vector representation is subjected to similarity matching with the document representations of the documents in the document retrieval library, the similarity between the vector representation of the query information and the document representations of the documents in the document retrieval library can be calculated, and at least one document with higher similarity is selected as a target document matched with the query information.
Alternatively, the documents in the document retrieval library may be ranked in order of high-to-low similarity, and a preset number of documents ranked first may be selected as target documents matching the query information according to the ranking result. The preset number can be configured and adjusted according to the needs and experience values of the actual application scene, and is not particularly limited herein.
Alternatively, a document with a similarity greater than a preset similarity threshold may be selected as the target document matching the query information according to the similarity. The preset similarity threshold value can be configured and adjusted according to the needs and experience values of the actual application scene, and is not particularly limited herein.
In this embodiment, the solution of the foregoing embodiment is adopted to perform document characterization on a plurality of documents in a document retrieval library, so as to obtain document representations of the documents. Specifically, a plurality of documents to be characterized in a document retrieval library and document association graphs of the plurality of documents are obtained, wherein each document association graph comprises nodes corresponding to each document and edges representing association relations among the documents; inputting a plurality of documents and document associated graphs into a document representation model, performing graph feature learning based on the document associated graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node; the characteristic representation of each node after updating is used as the document representation of the corresponding document. Further, the document representations of the documents in the document retrieval library are updated based on the document representations of the plurality of documents. The specific implementation process of obtaining the document representation of each document by characterizing a plurality of documents in the document retrieval library is referred to the relevant content in the foregoing embodiment, which is not described herein.
Step S1003, transmitting at least one target document to the end-side device.
The server returns the retrieved at least one target document matching the query information to the end-side device, so that the end-side device outputs the relevant information of the retrieved at least one target document matching the query information to the user. In addition, the server may send one or more items of key information of the target document to be output to the user to the end-side device according to the configuration of the retrieval system, so that the end-side device outputs the one or more items of key information of the target document.
The embodiment provides a document retrieval method, which adopts the scheme of the embodiment to perform document characterization on a plurality of documents in a document retrieval library to obtain document representations of the documents, and updates the document representations of the documents in the document retrieval library according to the document representations of the documents, so that the quality of the document representations in the document retrieval library can be improved, and the accuracy and recall rate of document retrieval can be improved.
Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present application. As shown in fig. 11, the server of the present embodiment may include: at least one processor 1101; and a memory 1102 communicatively coupled to the at least one processor.
The memory 1102 stores instructions executable by the at least one processor 1101, and the instructions are executed by the at least one processor 1101 to cause the server to perform the method according to any one of the embodiments, and the specific functions and the technical effects that can be achieved are similar, and are not repeated herein.
Alternatively, the memory 1102 may be separate or integrated with the processor 1101. Optionally, as shown in fig. 11, the server further includes: firewall 1103, load balancer 1104, communication component 1105, power component 1106, and other components. Only some of the components are schematically shown in fig. 11, which does not mean that the server only comprises the components shown in fig. 11.
The embodiment of the present application further provides a computer readable storage medium, in which computer executable instructions are stored, and when a processor executes the computer executable instructions, the method of any one of the foregoing embodiments is implemented, and specific functions and technical effects that can be implemented are not described herein.
Embodiments of the present application also provide a computer program product comprising a computer program which, when executed by a processor, implements the method of any of the preceding embodiments. The computer program is stored in a readable storage medium, and the computer program can be read from the readable storage medium by at least one processor of the server, where execution of the computer program by at least one processor causes the server to execute the technical solution provided in any one of the method embodiments, and specific functions and technical effects that can be achieved are not repeated herein.
The embodiment of the application provides a chip, which comprises: the processing module and the communication interface, the processing module can execute the technical scheme of the server in the foregoing method embodiment. Optionally, the chip further includes a storage module (e.g. a memory), where the storage module is configured to store the instructions, and the processing module is configured to execute the instructions stored in the storage module, and execution of the instructions stored in the storage module causes the processing module to execute the technical solution provided in any one of the foregoing method embodiments.
The integrated modules, which are implemented in the form of software functional modules, may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or processor to perform some steps of the methods of the various embodiments of the present application.
It should be appreciated that the processor may be a processing unit (Central Processing Unit, CPU for short), but may also be other general purpose processors, digital signal processors (Digital Signal Processor DSP for short), application specific integrated circuits (Application Specific Integrated Circuit ASIC for short), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in a processor for execution. The memory may comprise a high-speed random access memory (Random Access Memory, abbreviated as RAM), and may further comprise a nonvolatile memory NVM, such as at least one magnetic disk memory, and may also be a U-disk, a removable hard disk, a read-only memory, a magnetic disk or optical disk, etc.
The memory may be an object store (Object Storage Service, OSS for short).
The memory may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
The communication component is configured to facilitate wired or wireless communication between the device in which the communication component is located and other devices. The device where the communication component is located may access a wireless network based on a communication standard, such as a mobile hotspot (WiFi), a mobile communication network of a second generation mobile communication system (2G), a third generation mobile communication system (3G), a fourth generation mobile communication system (4G)/Long Term Evolution (LTE), a fifth generation mobile communication system (5G), or a combination thereof. In one exemplary embodiment, the communication component receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
The power supply component provides power for various components of equipment where the power supply component is located. The power components may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the devices in which the power components are located.
The storage medium may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an application specific integrated circuit (Application Specific Integrated Circuits, ASIC for short). It is also possible that the processor and the storage medium reside as discrete components in an electronic device or a master device.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The order of the embodiments of the present application described above is merely for description and does not represent the advantages or disadvantages of the embodiments. In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations appearing in a particular order are included, but it should be clearly understood that the operations may be performed out of order or performed in parallel in the order in which they appear herein, merely for distinguishing between the various operations, and the sequence number itself does not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types. The meaning of "a plurality of" is two or more, unless specifically defined otherwise.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method of the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the claims, and all equivalent structures or equivalent processes using the descriptions and drawings of the present application, or direct or indirect application in other related technical fields are included in the scope of the claims of the present application.

Claims (16)

1. A method of document characterization, comprising:
acquiring a plurality of documents to be processed and a document association graph of the plurality of documents, wherein the document association graph comprises nodes corresponding to the documents and edges representing association relations among the documents;
inputting the content information of the plurality of documents and the document association graph into a document representation model, performing graph feature learning based on the document association graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node;
the characteristic representation of each node after updating is used as the document representation of the corresponding document.
2. The method of claim 1, wherein the document representation model comprises: a semantic information learning module and a relevance information learning module,
the semantic information learning module is used for: carrying out semantic information characterization learning on the content information of each document to obtain semantic representation of each document;
the correlation information learning module is used for: initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document, and carrying out graph feature learning based on the initialized document association graph to update the characteristic representation of each node.
3. The method of claim 2, wherein the semantic information learning module comprises: an entity relation diagram construction module, a first graph neural network, a text representation model and a semantic representation fusion module,
the entity relation diagram construction module is used for: constructing an entity relation diagram of each document according to the content information of each document;
the first graph neural network is used for: respectively performing graph sign learning on the entity relation graph of each document to obtain feature representations of the entities contained in each document, and fusing the feature representations of the entities contained in each document to obtain semantic representations of the entity levels of each document;
the text representation model is used for: respectively carrying out text representation on the content information of each document to obtain semantic representation of a text level of each document;
the semantic representation fusion module is used for: and fusing the semantic representation of the entity level and the semantic representation of the text level of each document to obtain the semantic representation of each document.
4. The method of claim 2, wherein the correlation information learning module comprises: the module and the second graph neural network are initialized,
The initialization module is used for: initializing feature representations of nodes in the document association graph by using semantic representations of the documents;
the second graph neural network is used for: and performing graph feature learning based on the initialized document association graph, and updating the feature representation of each node.
5. The method of claim 2, wherein said inputting the content information of the plurality of documents and the document association graph into a document representation model, through the document representation model, performing graph feature learning based on a document association graph initialized using semantic representations of the respective documents, updating feature representations of the respective nodes, comprises:
inputting the content information of the plurality of documents into the semantic information learning module, performing semantic information characterization learning on the content information of each document through the semantic information learning module to obtain semantic representation of each document, and inputting the semantic representation of each document into the relativity learning module;
inputting the document association graph into the correlation information learning module, initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document through the correlation information learning module, and carrying out graph feature learning based on the initialized document association graph to update the characteristic representation of each node.
6. The method of claim 5, wherein the semantic information learning module comprises: an entity relationship graph construction module, a first graph neural network and a text representation model,
inputting the content information of the plurality of documents into the semantic information learning module, performing semantic information characterization learning on the content information of each document through the semantic information learning module to obtain semantic representation of each document, wherein the semantic representation comprises the following steps:
inputting the content information of the plurality of documents into the entity relation diagram construction module, and constructing an entity relation diagram of each document according to the content information of each document through the entity relation diagram construction module;
inputting the entity relation graph of each document into the first graph neural network, and respectively performing graph sign learning on the entity relation graph of each document through the first graph neural network to obtain semantic representation of the entity level of each document;
inputting the content information of the documents into the text representation model, and respectively carrying out text representation on the key content information of each document through the text representation model to obtain semantic representation of the text level of each document;
And fusing the semantic representation of the entity level and the semantic representation of the text level of each document to obtain the semantic representation of each document.
7. The method of claim 5, wherein the correlation information learning module comprises: the module and the second graph neural network are initialized,
inputting the document association graph into the correlation information learning module, initializing feature representations of nodes in the document association graph by the correlation information learning module by using semantic representations of the documents, and performing graph feature learning based on the initialized document association graph to update the feature representations of the nodes, wherein the method comprises the following steps:
inputting the semantic representation of each document and the document association graph into the initialization module, and initializing the characteristic representation of each node in the document association graph by using the semantic representation of each document through the initialization module;
inputting the initialized document association diagram into the second diagram neural network, performing diagram feature learning based on the initialized document association diagram through the second diagram neural network, and updating the feature representation of each node.
8. The method of claim 2, wherein the training process of the document representation model comprises:
Inputting a plurality of document samples in a document set for training and document associated graphs of the plurality of document samples into a document representation model, masking initial edges in the document associated graphs of the plurality of document samples through the document representation model, initializing feature representations of nodes in the document associated graphs of the plurality of document samples by using semantic representations of the document samples to obtain an initialized document associated graph, performing graph feature learning based on the initialized document associated graph, and updating the feature representations of the nodes;
predicting the association relation among the nodes in the document association graph according to the updated characteristic representation of each node;
and calculating a first loss according to a predicted result of the association relation among the nodes in the document association diagram and an initial edge in the document association diagram, and updating parameters of a document representation model according to the first loss.
9. The method of claim 8, wherein the semantic information learning module comprises: an entity relation diagram construction module, a first graph neural network, a text representation model and a semantic representation fusion module,
the using of the semantic representation of each of the document samples by the document representation model includes:
Inputting the content information of each document sample in the document set into the entity relation diagram construction module, and constructing an entity relation diagram of each document sample according to the content information of each document sample by the entity relation diagram construction module;
inputting the entity relation diagram of each document sample in the document set into a first graph neural network, and respectively performing graph sign learning on the entity relation diagram of each document sample through the first graph neural network to obtain semantic representation of the entity level of each document sample;
inputting key content information of a plurality of document samples of the document set into the text representation model, and respectively carrying out text representation on the key content information of each document sample through the text representation model to obtain semantic representation of a text level of each document sample;
and fusing the semantic representation of the entity level and the semantic representation of the text level of each document sample to obtain the semantic representation of each document sample.
10. The method of claim 8, wherein the document represents a training process of a model, comprising:
predicting class labels of the document samples according to semantic representations of entity levels of the document samples to obtain first prediction results, and predicting class labels of the document samples according to semantic representations of text levels of the document samples to obtain second prediction results;
Calculating the distance between the first predicted result and the second predicted result to obtain a second loss; calculating cross entropy loss according to the first prediction result and/or the second prediction result and the actual category labels of the document samples to obtain third loss;
and updating parameters of the semantic information learning module according to the second loss and the third loss.
11. The method of any of claims 1-10, wherein the obtaining a plurality of documents to be processed, and a document association graph of the plurality of documents, comprises:
receiving a document set submitted by a terminal side device, wherein the document set comprises a plurality of documents to be processed;
and constructing a document association diagram corresponding to the document set, wherein the document association diagram comprises nodes corresponding to the documents and edges representing association relations among the documents.
12. The method of claim 11, wherein after the step of using the updated feature representation of each node as a document representation of the corresponding document, further comprises:
classifying the plurality of documents according to the document representation of each document to obtain a document classification result, and sending the document classification result to the terminal side equipment;
Or,
and predicting the association relation among the documents according to the document representation of the documents, and sending the association relation among the documents to the terminal side equipment.
13. The method of any of claims 1-10, wherein the obtaining a plurality of documents to be processed, and a document association graph of the plurality of documents, comprises:
acquiring a plurality of documents contained in a document retrieval library and a document association graph of the plurality of documents;
after the updated characteristic representation of each node is used as the document representation of the corresponding document, the method further comprises the following steps:
updating the document representation of each document in the document retrieval library according to the document representations of the plurality of documents.
14. A document retrieval method, comprising:
responding to a document retrieval request, and acquiring query information input by a user;
mapping the query information into vector representations, and performing similarity matching on the vector representations and document representations of all documents in a document retrieval library to obtain at least one target document matched with the query information;
outputting information of the at least one target document;
wherein the document representation of each document in the document retrieval library is determined by: acquiring a plurality of documents to be characterized in a document retrieval library and document association graphs of the plurality of documents, wherein the document association graphs comprise nodes corresponding to the documents and edges representing association relations among the documents; inputting the plurality of documents and the document association graph into a document representation model, performing graph sign learning based on the document association graph initialized by using the semantic representation of each document through the document representation model, and updating the feature representation of each node; the characteristic representation of each node after updating is used as the document representation of the corresponding document.
15. A server, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor;
wherein the memory stores instructions executable by the at least one processor to cause the server to perform the method of any one of claims 1-14.
16. A computer readable storage medium having stored therein computer executable instructions which when executed by a processor implement the method of any of claims 1-14.
CN202311378060.2A 2023-10-23 2023-10-23 Method, server and storage medium for document characterization and document retrieval Pending CN117725220A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311378060.2A CN117725220A (en) 2023-10-23 2023-10-23 Method, server and storage medium for document characterization and document retrieval

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311378060.2A CN117725220A (en) 2023-10-23 2023-10-23 Method, server and storage medium for document characterization and document retrieval

Publications (1)

Publication Number Publication Date
CN117725220A true CN117725220A (en) 2024-03-19

Family

ID=90198630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311378060.2A Pending CN117725220A (en) 2023-10-23 2023-10-23 Method, server and storage medium for document characterization and document retrieval

Country Status (1)

Country Link
CN (1) CN117725220A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117950906A (en) * 2024-03-27 2024-04-30 西南石油大学 Method for deducing fault cause of server based on neural network of table graph

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117950906A (en) * 2024-03-27 2024-04-30 西南石油大学 Method for deducing fault cause of server based on neural network of table graph
CN117950906B (en) * 2024-03-27 2024-06-04 西南石油大学 Method for deducing fault cause of server based on neural network of table graph

Similar Documents

Publication Publication Date Title
CN111444428B (en) Information recommendation method and device based on artificial intelligence, electronic equipment and storage medium
CN105210064B (en) Classifying resources using deep networks
CN113627447B (en) Label identification method, label identification device, computer equipment, storage medium and program product
CN112307762B (en) Search result sorting method and device, storage medium and electronic device
CN110968695A (en) Intelligent labeling method, device and platform based on active learning of weak supervision technology
CN111611488B (en) Information recommendation method and device based on artificial intelligence and electronic equipment
CN111582409A (en) Training method of image label classification network, image label classification method and device
CN112749326A (en) Information processing method, information processing device, computer equipment and storage medium
CN112989212B (en) Media content recommendation method, device and equipment and computer storage medium
CN117725220A (en) Method, server and storage medium for document characterization and document retrieval
CN113011126A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN114357151A (en) Processing method, device and equipment of text category identification model and storage medium
CN114692007A (en) Method, device, equipment and storage medium for determining representation information
CN114330476A (en) Model training method for media content recognition and media content recognition method
CN113254649A (en) Sensitive content recognition model training method, text recognition method and related device
CN116881462A (en) Text data processing, text representation and text clustering method and equipment
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
CN116956183A (en) Multimedia resource recommendation method, model training method, device and storage medium
CN116955591A (en) Recommendation language generation method, related device and medium for content recommendation
CN116977701A (en) Video classification model training method, video classification method and device
CN113656560B (en) Emotion category prediction method and device, storage medium and electronic equipment
CN115186085A (en) Reply content processing method and interaction method of media content interaction content
CN114898184A (en) Model training method, data processing method and device and electronic equipment
CN114996435A (en) Information recommendation method, device, equipment and storage medium based on artificial intelligence
CN111552827A (en) Labeling method and device, and behavior willingness prediction model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination