CN114281990A

CN114281990A - Document classification method and device, electronic equipment and medium

Info

Publication number: CN114281990A
Application number: CN202111552308.3A
Authority: CN
Inventors: 李薿; 骆金昌; 王海威; 陈坤斌; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-12-17
Filing date: 2021-12-17
Publication date: 2022-04-05

Abstract

The disclosure provides a document classification method and device, electronic equipment and a medium, and relates to the technical field of artificial intelligence, in particular to cloud service, natural language processing, knowledge graph and deep learning technology. The implementation scheme is as follows: acquiring document relation information of a document to be classified, wherein the document relation information is used for representing the incidence relation between the document to be classified and a plurality of classified documents; generating a vector representation of the document to be classified based on the document relation information; and determining a target class to which the document to be classified belongs based on the vector representation.

Description

Document classification method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to cloud services, natural language processing, knowledge maps, and deep learning technologies, and in particular, to a method and an apparatus for document classification, an electronic device, a computer-readable storage medium, and a computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like: the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

A knowledge graph is a structured semantic knowledge base that can be represented in the form of a network topology composed of nodes and edges, where nodes represent entities and edges between nodes represent relationships between entities. The knowledge map has strong knowledge expression capability and flexibility, and can provide knowledge support for different application scenes such as information retrieval, content recommendation, machine question answering and the like.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The disclosure provides a document classification method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

According to an aspect of the present disclosure, there is provided a document classification method including: acquiring document relation information of a document to be classified, wherein the document relation information is used for representing the incidence relation between the document to be classified and a plurality of classified documents; generating a vector representation of the document to be classified based on the document relation information; and determining a target class to which the document to be classified belongs based on the vector representation.

According to an aspect of the present disclosure, there is provided a document classification apparatus including: the document relation information is used for expressing the incidence relation between the document to be classified and a plurality of classified documents; a representation module configured to generate a vector representation of the document to be classified based on the document relation information; and the classification module is configured to determine a target class to which the document to be classified belongs based on the vector representation.

According to an aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor, the memory storing instructions executable by the at least one processor to enable the at least one processor to perform the method.

According to an aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the above-described method.

According to an aspect of the disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described method.

According to one or more embodiments of the present disclosure, the efficiency and accuracy of document classification can be improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of a document classification method according to an embodiment of the present disclosure;

3A-3C illustrate schematic diagrams of document relationship diagrams according to embodiments of the disclosure;

FIG. 4 illustrates a block diagram of a document classification model according to an embodiment of the present disclosure;

5A, 5B show schematic diagrams of a knowledge-graph according to embodiments of the present disclosure;

FIG. 6 shows a block diagram of a document sorting apparatus according to an embodiment of the present disclosure; and

FIG. 7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

In the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related users are all in accordance with the regulations of related laws and regulations, and do not violate the customs of the public order.

A knowledge graph is a structured semantic knowledge base that can be represented in the form of a network topology composed of nodes and edges, where nodes represent entities and edges between nodes represent relationships between entities. An entity may be anything, such as a person, place, company, event, abstract concept, technical term, etc., and a relationship is used to express some kind of relationship between different entities. According to the type of the entity and the relation between the entities, the knowledge graph can store different information and be applied to different scenes.

With the development of information technology, a large number of documents are continuously emerged in each application scene. By establishing the association between the document and the entity in the knowledge graph of the relevant application scene, the document can be quickly searched and the relevant information can be quickly acquired.

For example, in an application scenario of enterprise knowledge management, an enterprise knowledge graph may be established, where entities in the enterprise knowledge graph may be people (e.g., employees, job positions of employees, job levels, departments, etc.), things (e.g., projects, development teams, platforms, tools, etc.), knowledge (e.g., documents, technical terms, product nouns, etc.), and so forth. There are a large number of documents in an enterprise, such as project requirement documents, project development documents, news documents, technical manuals, administrative documents, product help documents, and the like. By associating the documents with the entities in the enterprise knowledge graph, the user can be helped to quickly find related documents from different entity perspectives and acquire related information from the related documents, so that the working efficiency and the user experience are improved. For example, by associating project requirement documents and project development documents with project entities, a product manager can quickly find relevant documents of a certain project so as to manage the project. As another example, by associating a technical manual with a technical term entity, a technician can be enabled to quickly obtain documents relevant to a certain technology in order to learn a professional skill. For example, by associating the product help document with the product entity, the related information of the product can be acquired from the product help document so as to provide services such as product information question answering, knowledge recommendation and the like for the user.

In the related art, the entity corresponding to the document is usually determined by means of keyword matching. That is, a keyword is set for each entity, and an entity corresponding to a document is determined by matching the title or content of the document with the keyword of the entity. This method requires manual setting of keywords for each entity. The knowledge graph is often in an incremental state, and when a new entity is added to the knowledge graph, keywords matched with the new entity need to be mined and set, so that the labor cost is high, and the efficiency is low.

In other related techniques, a vector matching approach is used to determine the entity corresponding to the document. That is, the document and the entity are represented as vectors of the same dimension, and the entity corresponding to the document is determined by calculating the vector distance. The method generates a document vector only by information of the document itself. When a document is long (i.e., contains more text), the document may contain multiple different semantic topics, resulting in a document vector that does not accurately express the subject matter of the document, and the corresponding entities are determined based on the document vector with less accuracy.

Therefore, the embodiment of the disclosure provides a document classification method, which can improve the efficiency and accuracy of document classification. In the document classification method of the embodiment of the disclosure, each category of the document may correspond to one entity in the knowledge graph, and accordingly, the document classification method based on the embodiment of the disclosure can accurately and efficiently dig out the corresponding relationship between the document and the knowledge graph entity.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In embodiments of the present disclosure, server 120 may run one or more services or software applications that enable the document classification method to be performed.

In some embodiments, the server 120 may also provide other services or software applications that may include non-virtual environments and virtual environments. In certain embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user operating a

client device

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with the server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein and is not intended to be limiting.

The user may navigate using

client devices

101, 102, 103, 104, 105, and/or 106. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptops), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, Linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, Windows Phone, Android. Portable handheld devices may include cellular telephones, smart phones, tablets, Personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), Short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, Wi-Fi), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as music files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the database in response to the command.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

FIG. 2 shows a flow diagram of a document classification method 200 according to an embodiment of the disclosure. Method 200 may be performed at a server (e.g., server 120 shown in fig. 1) or may be performed at a client device (e.g.,

client devices

101, 102, 103, 104, 105, and 106 shown in fig. 1). That is, the execution subject of each step of the method 200 may be the server 120 shown in fig. 1, or may be the

client devices

101, 102, 103, 104, 105, and 106 shown in fig. 1.

As shown in fig. 2, the method 200 includes:

step 210, obtaining document relation information of the document to be classified, wherein the document relation information is used for representing the incidence relation between the document to be classified and a plurality of classified documents;

step 220, generating vector representation of the documents to be classified based on the document relation information; and

step 230, based on the vector representation, determining the target category to which the document to be classified belongs.

According to the embodiment of the disclosure, the vector representation of the document to be classified is generated based on the document relation information of the document to be classified and the classified document, and the target class to which the document to be classified belongs is determined based on the vector representation. On one hand, keywords do not need to be set for all categories, and therefore document classification efficiency is improved. On the other hand, the document relation information can express the incidence relation between the documents to be classified and the classified documents, the vector representation of the documents to be classified is generated based on the document relation information, and the generated vector representation can contain the information of the classified documents, so that the classification of the documents to be classified can be assisted by the information of the classified documents, the problem of limited classification effect caused by only depending on the information of the documents to be classified is avoided, and the classification accuracy is improved.

The various steps of method 200 are described in detail below.

In step 210, document relation information of the document to be classified is obtained, and the document relation information is used for representing the association relationship between the document to be classified and a plurality of classified documents.

Document classification is a computational task in the field of natural language processing, and refers to mapping a document into a preset class space, i.e., determining a class corresponding to the document. The category space may be set according to an actual application scenario.

For example, in an application scenario where correspondences between documents and knowledge-graph entities are mined, a category space may be a set of multiple entities in the knowledge-graph, with each entity in the set corresponding to a category. It should be noted that the category space of the document may include all entities in the knowledge-graph or may include some entities in the knowledge-graph. The number of entities in the knowledge-graph is typically very large, and in order to improve the classification efficiency, documents are generally classified only for a limited number of entities required by the current business. That is, the "entities" in the category space are part, but not all, of the entities included in the knowledge-graph.

For another example, in an application scenario where the sentiment polarity of a document is determined, the category space may be a set of sentiment polarities (e.g., positive, neutral, negative, etc.), each corresponding to a category. In the application scenario of library book classification, the document may be a book, and the category space may be a set consisting of a plurality of book classification numbers, each corresponding to a category. In an application scenario in which a search intention of a user is recognized, a document may be a search word input by the user or a search log of the user, and a category space may be a set composed of a plurality of content fields (e.g., news, question answering, sports, etc.), each of which corresponds to a category.

The following description will be given of a document classification method according to an embodiment of the present disclosure, taking an application scenario (that is, a category of a document is an entity in a knowledge graph) for mining a correspondence between the document and a knowledge graph entity as an example. However, it should be understood that the document classification method of the embodiments of the present disclosure may be applicable to any application scenario, i.e., the category of the document may be any value and is not limited to entities in the knowledge graph.

In step 210, the classified documents refer to documents to which the category has been determined. Accordingly, in an application scenario where correspondences between documents and knowledgegraph entities are mined, a classified document refers to a document for which entity category mapping has been completed, i.e., a document for which the corresponding entity has been determined. It should be noted that the number of entities corresponding to the classified documents may be one or more.

According to some embodiments, the classified documents may be determined by information matching of high quality documents. A high-quality document refers to a document of a content specification having a title structure or attribute information, and accordingly, the document may be classified based on the title and/or attribute information of the high-quality document. If the document is successfully corresponding to an entity, the document is the classified document. For example, for a document, the title (e.g., including multiple titles at different levels) and/or attribute information (e.g., author, unit, category, etc.) of the document may be matched with description information (e.g., name, alias, content description, etc.) of an entity. In response to determining that the title or attribute information of the document matches the description information of an entity, the document is marked as a classified document, and the entity matched with the document is marked as the entity corresponding to the classified document.

The document relation information is used for representing the incidence relation between the document to be classified and a plurality of classified documents. It is to be understood that the document relationship information may include a plurality of documents to be classified. In practical situations, since there are a large number of documents with irregular contents and low quality, entities corresponding to the documents cannot be determined by means of information matching. These documents, which can be the documents to be classified according to the embodiment of the present disclosure, are modeled in the document relationship information, and are classified by the method 200.

As described above, since there are a large number of documents with irregular contents and low quality, a plurality of documents to be classified are generally included in the document relation information, and the number of documents to be classified is generally much larger than that of classified documents. For each document to be classified in the document relationship information, it may be classified by performing the method 200 of the embodiments of the present disclosure. In other words, the document to be classified in step 210 may be any one of the documents to be classified included in the document relationship information.

The document relation information is used for representing the incidence relation between the document to be classified and a plurality of classified documents. According to some embodiments, the document to be classified and each classified document respectively include a plurality of content elements. The document to be classified and the classified document may be associated by the same content element included. Therefore, the association relation between the documents can be quickly established by using the same content in different documents.

The content elements may be any granularity of textual content included in the document, such as words, sentences, paragraphs, and the like.

It will be appreciated that because of the small granularity of words and phrases, even different documents will typically have multiple identical words or phrases. The granularity of sentences and paragraphs is large, and the repetition of sentences or paragraphs rarely occurs between different documents. Therefore, words or phrases are often chosen to establish associations between documents. Furthermore, words and phrases can be simultaneously selected to establish the association relationship between the documents, so that the relational expression between the documents is more detailed and accurate.

According to some embodiments, the document relationship information may be represented by a plurality of triplets shaped as (A, B, C), where A, C is a content element included in two different documents (which may be documents to be classified or classified) and B is A, C. For example, the document to be classified Doc1 and the classified document Doc2 both include content element Word1, and then the association relationship between the document to be classified Doc1 and the classified document Doc2 may be represented by a triple (Doc1, Word1, Doc 2). As another example, both classified document Doc2 and classified document Doc3 include the content element Word2, and the association of classified document Doc2 and classified document Doc3 may be represented by a triple (Doc2, Word2, Doc 3).

According to other embodiments, the document relationship information may be represented by a document relationship graph. The document relationship graph may be pre-constructed before performing step 210, i.e., the method 200 may further include: prior to step 210, a document relationship graph is constructed. Accordingly, in executing step 210, the constructed document relationship diagram may be acquired, thereby acquiring the document relationship information.

The document relationship graph may be constructed in different ways, with different structures (e.g., different node types, edge weight settings, etc.).

According to some embodiments, the document relationship graph may be an undirected weighted graph comprising only document nodes, each node corresponding to a document (which may be a document to be classified or a classified document), as shown in FIG. 3A. The connecting edge between the nodes is used for representing the association of the two documents, and the weight of the connecting edge is used for representing the association degree of the two documents.

Accordingly, the document relationship graph may be constructed as follows: firstly, a plurality of content elements included in the document to be classified and each classified document are respectively obtained. For example, the document may be divided into a plurality of words, or the document may be divided into a plurality of words by a word segmentation algorithm, so as to obtain the content elements included in each document. Whether there is a connecting edge between two document nodes is determined by determining whether two documents include the same content element. The weight of the connecting edge between two documents is determined by calculating the number or proportion of identical content elements. For example, the Jaccard distance (Jaccard) of two documents may be used as a weight for a connecting edge between the two documents. In fig. 3A, Jaccard (i, j) represents a Jaccard distance between document i and document j. Alternatively, the weights in FIG. 3A may be removed, i.e., the document relationship graph is represented as a undirected weightless graph.

According to other embodiments, the document relationship graph may be an undirected authoritative graph including document nodes and content element nodes. Each node corresponds to a document (which may be a document to be classified or a classified document) or a content element (which may be a word or a word). The connecting edge between two content elements indicates that the two content elements appear in the same document, and the weight of the connecting edge between the two content elements indicates the degree of correlation between the two content elements (hereinafter, the degree of correlation between the two content elements is referred to as "first degree of correlation"). The connecting edge between the document and the content element indicates that the document includes the content element, and the weight of the connecting edge between the document and the content element indicates the degree of correlation between the document and the content element (hereinafter, the degree of correlation between the document and the content element is referred to as "second degree of correlation"). Alternatively, the weights in the above embodiments may be removed, that is, the document relation graph is represented as a undirected weightless graph.

Accordingly, the document relationship graph may be constructed as follows: firstly, a plurality of content elements included in the document to be classified and each classified document are respectively obtained. Subsequently, a first degree of correlation between any two content elements and a second degree of correlation between any one content element and any one document are determined. Then, a document relation graph is constructed based on the first relevance and the second relevance, the document relation graph comprises a plurality of nodes and connecting edges among the plurality of nodes, each node corresponds to one document or one content element, the weight of the connecting edge between the two content elements indicates the first relevance between the two content elements, and the weight of the connecting edge between the content element and the document indicates the second relevance between the content element and the document. Based on the embodiment, the information expressed by the document relation graph can be richer, so that the document classification effect is improved.

The first degree of correlation may be Positive Point Mutual Information (PPMI), for example, and the second degree of correlation may be Term Frequency-Inverse Document Frequency (TF-IDF), for example.

FIG. 3B shows an example of a document relationship diagram corresponding to the above embodiment, in which the content elements are words. The first degree of correlation is implemented as PPMI, which (i, j) represents the PPMI between words i and j. The second degree of relevance is implemented as TF-IDF, where TF-IDF (i, j) represents TF-IDF between word i and document j.

FIG. 3C shows another example of a document relationship diagram corresponding to the above-described embodiment, in which the content elements include words and phrases. The first degree of correlation between words, word to word, is implemented as PPMI. The second degree of correlation between words and documents, and between words and documents is implemented as TF-IDF. According to some embodiments, as shown in fig. 3, a third degree of correlation between words may be further set, which may also be implemented as TF-IDF, for example.

Further, according to some embodiments, the first degree of correlation between any two content elements may be determined according to the following steps: dividing each document into at least one document block to obtain a plurality of document blocks; for any document block in the document blocks, determining the local relevance of two content elements in the document block; and taking the average value of the local relevance of the two content elements in the plurality of document blocks as the first relevance of the two content elements. Therefore, the document can be locally analyzed, two words which are far away and have inconsistent expression subjects are prevented from being counted as related although the two words are co-occurring in the same long document, and therefore the accuracy of the document relation graph and the document classification is improved.

A document block includes a plurality of content elements in succession that are located in the same document. It should be noted that different document blocks of the same document may have overlapping content elements. Each document chunk may include the same number of content elements. According to some embodiments, a sliding window approach may be employed to divide a document into a plurality of document blocks. That is, the window is slid in the document according to the set window size (i.e. the number of content elements included in each window) and the sliding step (i.e. the number of content elements passed by each sliding), and the content elements in the window are divided into one document block. For example, the window size may be set to 50, the step size is set to 2, and the document is divided into a plurality of document blocks in a sliding window manner, that is, the content element 1 to the content element 50 in the document are divided into one document block, the content element 3 to the content element 52 are divided into one document block, the content element 5 to the content element 54 are divided into one document block, and so on.

According to some embodiments, an initial vector representation of each node in the document relationship graph may be further generated. Also, the initial vector representations for different types of nodes may be generated differently. For example, for a document node (which may be a document to be classified or a classified document), a pre-trained coding model (e.g., Ernie model, BERT model, etc.) may be used to code the document, so as to obtain an initial vector representation of the document. For content element nodes (e.g., words), their initial vector representation may be set to a normal distribution random value with a mean of zero. It is understood that in the subsequent steps, the coding model and the initial vector of each node may be updated based on the labeled data (e.g., the classified documents and their corresponding categories) to obtain a coding model that can more accurately extract the document features and generate a vector representation that can more accurately express the semantics of each node.

In step 220, a vector representation of the documents to be classified is generated based on the document relationship information.

According to some embodiments, the document relationship information may be represented by a document relationship graph, and accordingly, step 220 may include: acquiring initial vector representation of a document to be classified; and updating the initial vector representation through a graph neural network based on the document relation graph to obtain the vector representation of the document to be classified. The graph neural network can comprehensively express the structure and the propagation characteristics of the document relation graph, and better excavate the association relation among the documents, thereby improving the feature (namely vector representation) extraction effect and the accuracy of document classification. The Graph neural Network may be, for example, a Graph Convolutional neural Network (GCN), a Graph AttenTion Network (GAT), a Graph sage (Graph SAmple and aggregate) Network, or the like, but is not limited thereto.

According to some embodiments, the initial vector representation of the document to be classified is obtained by encoding the document to be classified using a pre-trained encoding model. The pre-trained coding model may be, for example, an Ernie model, a BERT model, or the like, but is not limited thereto. The initial vector representation of the document to be classified is generated by adopting the coding model pre-trained from the large corpus, so that the acquisition efficiency and accuracy of the initial vector representation can be improved.

According to some embodiments, the step of updating the initial vector representation of the document to be classified by the graph neural network further comprises: determining at least one neighbor node of the document to be classified based on the document relation graph; and inputting the adjacency matrix of the document relational graph, the initial vector representation of the document to be classified and the vector representation of the at least one neighbor node into the graph neural network to obtain the updated vector representation of the document to be classified output by the graph neural network. Therefore, the characteristics of the neighbor nodes (including classified documents) can be fused into the characteristics of the documents to be classified, the classification of the documents to be classified is assisted, and the classification accuracy is improved.

According to some embodiments, the graph neural network may be derived, for example, by training the classified documents and their corresponding entities (i.e., class labels). According to other embodiments, the graph neural network may also be used directly to aggregate and update the vector representations of the nodes without training.

In step 230, a target class to which the document to be classified belongs is determined based on the vector representation of the document to be classified.

Where the category space is a set of knowledge-graph entities, the target category corresponds to a target entity in the knowledge-graph. Therefore, the document classification method based on the embodiment of the disclosure can accurately and efficiently dig out the corresponding relation between the document and the knowledge graph entity. According to some embodiments, a vector representation of a document to be classified may be input into a trained classifier to arrive at a target class (i.e., target entity) output by the classifier. The classifier may be derived, for example, by training the classified documents and their corresponding class labels (i.e., entity labels). In the training process, the input of the classifier is a vector representation of the classified documents, and the output is a prediction category of the classified documents. A loss value of the classifier is calculated based on the prediction class and the true class label (the loss function may be, for example, a cross entropy loss function), and a parameter of the classifier is adjusted based on the loss value. The steps of calculating the loss value and adjusting the parameters can be executed for multiple times in a circulating way until the loss value is smaller than the threshold value, and the training of the classifier is completed.

According to some embodiments, the above-mentioned coding model, graph neural network, and classifier may be trained by the classified documents and their corresponding class labels (i.e., entity labels) as a whole document classification model.

FIG. 4 shows a block diagram of a document classification model 400 according to an embodiment of the disclosure. As shown in FIG. 4, document classification model 400 includes an encoding model 410, a graph neural network 420, and a classifier 430.

The coding model 410 takes a document as input and the initial vector of output documents represents Embedding 1. The initial vector of a word represents Embedding2, and the initial vector of a word represents Embedding3 as randomly generated. An initial vector representation of each document, word is input into the graph neural network 420.

The graph neural network 420 further includes a computation unit 422, an aggregation unit 424, and an update unit 426.

The computing unit 422 is a linear layer for mapping the initial vector representation to extract more meaningful node feature information (as mentioned earlier, the nodes may be documents, words). The calculation process of the linear layer can be represented as H₁＝σ(W₁X+b₁). Wherein X is a matrix formed by initial vector representations of all nodes and is input into a linear layer; w₁、b₁Weights and offsets, respectively, for the linear layer are trainable parameters; σ () is sigmoid activation function, H₁Is the output of the linear layer.

The aggregation unit 424 is configured to aggregate the vector representation of the node with the vector representations of its neighboring nodes, so as to enrich the information of the self node with the information of the neighboring nodes. The aggregation unit 424 may, for example, include two aggregation layers, i.e., two aggregations of vector representations of nodes in the document relationship graph. The calculation process of the two-time polymerization can be expressed as

Wherein H₂、H₃Outputs of the first aggregation layer and the second aggregation layer respectively; a is an adjacency matrix of the document relation graph; i is an identity matrix; d is a degree matrix of the document relation graph (which can be calculated according to the adjacency matrix A); w₂、W₃The weights for the first aggregation layer and the second aggregation layer may be trainable parameters or fixed random values. H output after polymerization treatment in the polymerization unit 424₃Is a matrix composed of aggregated vector representations for each node.

The updating unit 426 is configured to update the aggregated vector representation of each node output by the aggregating unit 424, so that the aggregated vector representation of each node has richer information representations. The update unit 426 may adopt an MLP (multi layer Perceptron) structure, for example, adopting two linear layers. The calculation process of the update unit 426 may be denoted as H₄＝W₅(W₄H₃+b₂)+b₃Wherein H is₄Is the output of update unit 426; w₄、b₂Weight and offset, W, respectively, of the first linear layer₅、b₃Weights and biases, respectively, for the second linear layer, four are trainable parameters. H output by the update unit 426₄Is a matrix composed of updated vector representations of the nodes. As shown in FIG. 4, the updated vector is denoted as Embedding 1'、Embedding2’、Embedding3’。

The classifier 430 is configured to map the updated vector representation Embedding 1' of the document node to a classification space, classify the document node through the softmax layer, and correspond the document to a certain entity (i.e., category). The classifier 430 may be implemented as a combination of a linear layer (full link layer) and a softmax layer, and its calculation process may be expressed as Y ═ argmax (softmax (W)₆H₄+b₄) Wherein W is₆、b₄The weights and offsets of the linear layers are respectively used as trainable parameters; argmax () is a function that parameterizes a function, and Y ═ argmax (softmax (W)₆H₄+b₄) Denotes that Y is softmax (W)₆H₄+b₄) The identity of the entity at which the maximum value is taken. That is, the output of the classifier 430 is the entity identification Y corresponding to the document.

For the model 400 shown in FIG. 4, the classified documents may be employed to train them to adjust trainable parameters of the model (e.g., weights W described above)₁-W₆And bias b₁-b₄). After the training of the model 400 is completed, the document to be classified may be input into the model, so as to obtain the entity identifier Y corresponding to the document to be classified output by the model, that is, to determine the target entity corresponding to the document to be classified.

According to some embodiments, in the event that the target categories correspond to target entities in the knowledge-graph, after the target categories to which the documents to be classified belong are determined by step 230, the documents to be classified may be associated with the respective target entities in the knowledge-graph. There are a variety of ways to associate a document to be classified with a target entity.

According to some embodiments, the document to be classified may be added to the knowledge-graph as an entity, and a relationship edge may be established between the entity and the target entity determined in step 230. For example, FIG. 5A illustrates a partial graph showing a knowledge-graph. After step 230, the target entity corresponding to the document to be classified is determined to be "entity 5". The document to be classified may be associated with the target entity by adding the document to be classified as an entity, i.e., "entity 8" in fig. 5A, to the knowledge-graph and establishing a relationship edge (as shown by the dashed line in fig. 5A) between entity 8 and entity 5.

It is to be understood that in fig. 5A, different node shapes represent different entity types. For example, entity 5 is of a different type than entity 8, entity 5 may be of a type such as "technical term" and entity 8 may be of a type such as "document".

According to other embodiments, each entity in the knowledge-graph has associated document attributes for recording documents associated with the entity. Accordingly, after the target entity corresponding to the document to be classified is determined through step 230, the identifier of the document to be classified may be added to the relevant document attribute of the target entity, so as to associate the document to be classified with the target entity. For example, via step 230, the target entity corresponding to the document to be classified is determined to be "entity 5". As shown in FIG. 5B, the entity 5 has a number of attributes, such as type, name, related documents, and so forth. The identification of the document to be classified "Doc 9" may be added to the relevant document attributes of entity 5, thereby associating the document to be classified with the target entity.

According to the embodiment of the disclosure, a document classification device is also provided. Fig. 6 shows a block diagram of the structure of a document sorting apparatus 600 according to an embodiment of the present disclosure. As shown in fig. 6, the apparatus 600 includes:

an obtaining module 610 configured to obtain document relation information of a document to be classified, where the document relation information is used to represent an association relationship between the document to be classified and a plurality of classified documents;

a representation module 620 configured to generate a vector representation of the document to be classified based on the document relation information; and

a classification module 630 configured to determine a target class to which the document to be classified belongs based on the vector representation.

According to some embodiments, the document to be classified and each classified document respectively comprise a plurality of content elements, the document to be classified and the classified document being associated by the same content elements that the classified document comprises.

According to some embodiments, the document relationship information is represented by a document relationship graph, the apparatus 600 further comprising: an acquisition unit configured to acquire a plurality of content elements included in the document to be classified and each classified document, respectively; a determination unit configured to determine a first degree of correlation between any two content elements, and a second degree of correlation between any one content element and any one document; and a construction unit configured to construct the document relationship graph based on the first relevance and the second relevance, wherein the document relationship graph comprises a plurality of nodes and connecting edges between the plurality of nodes, each node corresponds to one document or one content element, the weight of the connecting edge between two content elements indicates the first relevance between the two content elements, and the weight of the connecting edge between a content element and a document indicates the second relevance between the content element and the document.

According to some embodiments, the determining unit is further configured to: dividing each document into at least one document block to obtain a plurality of document blocks; for any document block of the plurality of document blocks, determining a local relevance of two content elements within the document block; and taking an average of local relevance of the two content elements within the plurality of document blocks as a first relevance of the two content elements.

According to some embodiments, the document relationship information is represented by a document relationship graph, and the representing module 620 further comprises: an initialization unit configured to obtain an initial vector representation of the document to be classified; and the updating unit is configured to update the initial vector representation through a graph neural network based on the document relation graph to obtain the vector representation.

According to some embodiments, the initialization unit is further configured to: and coding the document to be classified by adopting a pre-trained coding model to obtain the initial vector representation.

According to some embodiments, the update unit is further configured to: determining at least one neighbor node of the document to be classified based on the document relation graph; and inputting the adjacency matrix of the document relational graph, the initial vector representation and the vector representation of each of the at least one neighbor node into the graph neural network to obtain the vector representation of the document to be classified output by the graph neural network.

According to some embodiments, the target category corresponds to a target entity in the knowledge-graph.

It should be understood that the various modules or units of the apparatus 600 shown in fig. 6 may correspond to the various steps in the method 200 described with reference to fig. 2. Thus, the operations, features and advantages described above with respect to the method 200 are equally applicable to the apparatus 600 and the modules and units comprised thereby. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.

Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. For example, the presentation module 620 and the classification module 630 described above may be combined into a single module in some embodiments.

It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 6 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the modules 610 and 630 may be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the electronic device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote controller. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetooth^TMDevices, 802.11 devices, Wi-Fi devices, WiMAX devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as the method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When the computer program is loaded into RAM703 and executed by the computing unit 701, one or more steps of the method 200 described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. A method of document classification, comprising:

acquiring document relation information of a document to be classified, wherein the document relation information is used for representing the incidence relation between the document to be classified and a plurality of classified documents;

generating a vector representation of the document to be classified based on the document relation information; and

and determining a target class to which the document to be classified belongs based on the vector representation.

2. The method of claim 1, wherein the documents to be classified and each classified document respectively include a plurality of content elements, the documents to be classified and the classified documents being associated by the same content elements that the classified documents include.

3. The method of claim 1 or 2, wherein the document relationship information is represented by a document relationship graph, the method further comprising:

respectively acquiring a plurality of content elements included in the document to be classified and each classified document;

determining a first degree of correlation between any two content elements and a second degree of correlation between any content element and any document; and

and constructing the document relation graph based on the first relevance and the second relevance, wherein the document relation graph comprises a plurality of nodes and connecting edges among the plurality of nodes, each node corresponds to one document or one content element, the weight of the connecting edge between two content elements indicates the first relevance between the two content elements, and the weight of the connecting edge between a content element and a document indicates the second relevance between the content element and the document.

4. The method of claim 3, wherein determining a first degree of correlation between any two content elements comprises:

dividing each document into at least one document block to obtain a plurality of document blocks;

for any document block of the plurality of document blocks, determining a local relevance of two content elements within the document block; and

and taking the average value of the local relevance of the two content elements in the plurality of document blocks as the first relevance of the two content elements.

5. The method of any of claims 1-4, wherein the document relationship information is represented by a document relationship graph, and wherein generating the vector representation of the document to be classified based on the document relationship information comprises:

acquiring an initial vector representation of the document to be classified; and

updating the initial vector representation through a graph neural network based on the document relational graph to obtain the vector representation.

6. The method of claim 5, wherein the initial vector representation is obtained by encoding the document to be classified using a pre-trained encoding model.

7. The method of claim 5 or 6, wherein updating the initial vector representation by a graph neural network based on the document relationship graph comprises:

determining at least one neighbor node of the document to be classified based on the document relation graph; and

inputting the adjacency matrix of the document relational graph, the initial vector representation and the vector representation of each of the at least one neighbor node into the graph neural network to obtain the vector representation of the document to be classified output by the graph neural network.

8. The method of any of claims 1-7, wherein the target category corresponds to a target entity in a knowledge-graph.

9. A document sorting apparatus comprising:

the document relation information is used for expressing the incidence relation between the document to be classified and a plurality of classified documents;

a representation module configured to generate a vector representation of the document to be classified based on the document relation information; and

a classification module configured to determine a target class to which the document to be classified belongs based on the vector representation.

10. The apparatus of claim 9, wherein the document to be classified and each classified document respectively include a plurality of content elements, the document to be classified being associated with the same content elements by which the classified document is included.

11. The apparatus of claim 9 or 10, wherein the document relationship information is represented by a document relationship graph, the apparatus further comprising:

an acquisition unit configured to acquire a plurality of content elements included in the document to be classified and each classified document, respectively;

a determination unit configured to determine a first degree of correlation between any two content elements, and a second degree of correlation between any one content element and any one document; and

the document relation graph comprises a plurality of nodes and connecting edges among the plurality of nodes, each node corresponds to one document or one content element, the weight of the connecting edge between two content elements indicates the first relevance among the two content elements, and the weight of the connecting edge between a content element and a document indicates the second relevance among the content element and the document.

12. The apparatus of claim 11, wherein the determining unit is further configured to:

13. The apparatus of any of claims 9-12, wherein the document relationship information is represented by a document relationship graph, and wherein the representation module further comprises:

an initialization unit configured to obtain an initial vector representation of the document to be classified; and

an updating unit configured to update the initial vector representation through a graph neural network based on the document relation graph to obtain the vector representation.

14. The apparatus of claim 13, wherein the initialization unit is further configured to: and coding the document to be classified by adopting a pre-trained coding model to obtain the initial vector representation.

15. The apparatus of claim 13 or 14, wherein the updating unit is further configured to:

16. The apparatus of any of claims 9-15, wherein the target category corresponds to a target entity in a knowledge-graph.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-8 when executed by a processor.