CN115422358A - Document classification method, apparatus, electronic device, medium, and computer program product - Google Patents

Document classification method, apparatus, electronic device, medium, and computer program product Download PDF

Info

Publication number
CN115422358A
CN115422358A CN202211068142.2A CN202211068142A CN115422358A CN 115422358 A CN115422358 A CN 115422358A CN 202211068142 A CN202211068142 A CN 202211068142A CN 115422358 A CN115422358 A CN 115422358A
Authority
CN
China
Prior art keywords
document
event
keywords
word segmentation
event document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211068142.2A
Other languages
Chinese (zh)
Inventor
李铖
邱琳
钟其
陈睿丰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202211068142.2A priority Critical patent/CN115422358A/en
Publication of CN115422358A publication Critical patent/CN115422358A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method, apparatus, electronic device, medium, and computer program product for knowledge-graph based document classification. The method and the device can be used in the technical field of artificial intelligence. The document classification method based on the knowledge graph comprises the following steps: for the obtained mxn i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, m is multiplied by n i The event documents are from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more, n i Is greater than0, i is an integer of 1 to m, n j Is an integer of 1 or more, j is 1 or more and m × n or less i An integer of (d); determining a keyword of each event document according to the word segmentation result; calculating sentence similarity of every two event documents from two different event document sets according to the keywords; and respectively constructing document knowledge graphs based on the same event according to the sentence similarity.

Description

Document classification method, apparatus, electronic device, medium, and computer program product
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, an electronic device, a medium, and a computer program product for classifying documents based on a knowledge graph.
Background
At present, some enterprises generally establish independent information systems in each department and subordinate units, and can realize effective electronic processing in the departments. However, as informatization is further advanced, more and more information needs to be circulated across platforms and departments, and overall management is performed at a higher level. However, in practice, because the time for constructing each system is often different, developers are different, technical platforms and functions of different departments or business units are different, independent and different event numbers are provided in the information systems of each department or business unit, and because the information is manually input, the theme usually presents the situation that the semantics are similar but the characters are different, the information cannot be circulated in a unified format, and the information damming lake is formed. In actual work, often, a work flow needs a plurality of different departments to cooperate with each other to complete, and information cannot circulate in a unified format, so that the work efficiency of each department or business unit is greatly reduced, and the work progress is influenced.
Disclosure of Invention
In view of the foregoing, the present disclosure provides a method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for classifying documents based on a knowledge graph, which are efficient and easy to manage.
One aspect of the present disclosure provides a method for classifying documents based on a knowledge-graph, including: for the obtained mxn i Carrying out word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, the m is multiplied by n i Each event document is from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more,n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i An integer of (d); determining a keyword of each event document according to the word segmentation result; calculating sentence similarity of every two event documents from two different event document sets according to the keywords; and respectively constructing document knowledge graphs based on the same event according to the sentence similarity.
According to the document classification method based on the knowledge graph of the embodiment of the disclosure, the obtained m multiplied by n is classified i The word segmentation is carried out on each event document, and n of each event document can be obtained j Word segmentation results; determining the key words of each event document according to the word segmentation result; according to the keywords, the sentence similarity of every two event documents from two different event document sets can be calculated; according to the sentence similarity, document knowledge maps based on the same event can be respectively constructed. Therefore, a plurality of event documents covering a plurality of business matters can be constructed into a document knowledge graph corresponding to the business matters one by one, so that a work flow needing to be completed by matching a plurality of different departments can circulate in a unified format, the work efficiency of each department or business unit is improved, the work schedule is accelerated, and the management of an enterprise manager is facilitated.
In some embodiments, the pair of acquired mxn i Carrying out word segmentation on each event document to obtain n of each event document j The word segmentation result comprises: obtaining m event document sets, wherein each event document set comprises n i An event document; merging the m event document sets to obtain a document text set; and segmenting each event document in the document text set to obtain n of each event document j And (5) word segmentation results.
In some embodiments, the determining the keyword of each event document according to the word segmentation result includes: calculating a weight value of each word segmentation result of each event document; and comparing the weighted value with a preset weighted threshold value, and taking the word segmentation result corresponding to the weighted value meeting the weighted threshold value as the keyword corresponding to the event document.
In some embodiments, the two event documents from two different event document sets include a keywords, the event document is taken as a first document, the other event document from the two different event document sets includes b keywords, the event document is taken as a second document, a and b are both integers greater than or equal to 1, and calculating the sentence similarity of each two event documents from the two different event document sets according to the keywords includes: when a is larger than or equal to b, respectively calculating the word contribution degree between each keyword in the first document and each of b keywords in the second document; b word contribution degrees corresponding to each keyword in the a keywords are added according to the first word similarity in the descending order; and averaging the sums, and taking the average as the sentence similarity of two event documents from two different event document sets.
In some embodiments, calculating a word contribution degree between the keyword in the first document and the keyword in the second document comprises: when the keywords in the first document are the same as the keywords in the second document, the word contribution degree between the keywords is the product of the weighted values of the keywords; when the keywords in the first document are different from the keywords in the second document and the cosine similarity among the keywords can be inquired in a pre-training dictionary, the word contribution degree among the keywords is the product of the cosine similarity and the weight value of the keywords; and when the keywords in the first document are different from the keywords in the second document and the cosine similarity between the keywords cannot be inquired in the pre-training dictionary, the word contribution degree between the keywords is 0.
In some embodiments, the respectively constructing document knowledge graphs based on the same event according to the sentence similarity includes: step one, traversing all the sentence similarity, taking the sentence similarity with the first rank as a connecting edge according to the sentence similarity with the similarity larger than 0 from big to small, and taking two event documents corresponding to the sentence similarity as nodes to construct a knowledge graph; step two, sequencing the sentence similarity degrees which are related to the edge nodes in the knowledge graph and are larger than 0 in the descending order, taking the sentence similarity degree with the first rank as a connecting edge, taking another event document corresponding to the sentence similarity degree as a new edge node, and expanding the knowledge graph; step three, circularly executing the step two until a new edge node cannot be found, and taking the knowledge graph in the step two as a document knowledge graph based on the same event; and step four, circularly executing the step one to the step three until all the sentence similarity degrees which are larger than 0 are used as connecting edges in the document knowledge graph.
In some embodiments, the constructing document knowledge graphs based on the same event according to the sentence similarity further includes: and step five, taking the event document corresponding to the sentence similarity equal to 0 as a node to construct a document knowledge graph.
Another aspect of the present disclosure provides a knowledge-graph-based document categorization apparatus, comprising: a word segmentation module for performing segmentation on the obtained mxn i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, the m × n i Each event document is from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more, n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i An integer of (d); the determining module is used for determining the key words of each event document according to the word segmentation result; the calculation module is used for calculating the sentence similarity of every two event documents from two different event document sets according to the keywords; and a construction module for executing the sentence similarity according to the sentence similarityAnd respectively constructing document knowledge graphs based on the same event.
Another aspect of the present disclosure provides an electronic device comprising one or more processors and one or more memories, wherein the memories are used for storing executable instructions, which when executed by the processors, implement the method as described above.
Another aspect of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.
Another aspect of the disclosure provides a computer program product comprising a computer program comprising computer executable instructions for implementing the method as described above when executed.
Drawings
The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of the embodiments of the present disclosure with reference to the accompanying drawings, in which:
fig. 1 schematically illustrates an exemplary system architecture to which the methods, apparatus, and methods may be applied, in accordance with an embodiment of the present disclosure;
FIG. 2 schematically illustrates a flow diagram of a method of knowledge-graph based document classification in accordance with an embodiment of the present disclosure;
FIG. 3 schematically shows a graph of m × n obtained pairs according to an embodiment of the disclosure i Performing word segmentation on each event document to obtain n of each event document j A flow chart of word segmentation results;
FIG. 4 schematically illustrates a flow chart for determining keywords for each event document according to the word segmentation result according to an embodiment of the present disclosure;
FIG. 5 schematically illustrates a flow chart for calculating sentence similarity for every two event documents from two different sets of event documents according to keywords according to an embodiment of the present disclosure;
FIG. 6 schematically shows a flow diagram for calculating a word contribution between a keyword in a first document and a keyword in a second document according to an embodiment of the disclosure;
FIG. 7 schematically illustrates a flow chart of separately constructing document knowledge-graphs based on the same event according to sentence similarity, according to an embodiment of the present disclosure;
FIG. 8 schematically shows a schematic diagram of a knowledge-graph according to an embodiment of the present disclosure;
FIG. 9 is a flow diagram for constructing document knowledge graphs based on the same event according to sentence similarity, respectively, according to an embodiment of the present disclosure;
FIG. 10 schematically illustrates a flow diagram of a method of knowledge-graph based document categorization according to an embodiment of the disclosure;
FIG. 11 schematically illustrates a flow diagram of a document categorization method according to an embodiment of the disclosure;
FIG. 12 is a block diagram that schematically illustrates an arrangement of a knowledge-graph based document categorization apparatus according to an embodiment of the present disclosure;
FIG. 13 schematically illustrates a block diagram of a structure of a participle module according to an embodiment of the disclosure;
FIG. 14 schematically illustrates a block diagram of the structure of a determination module according to an embodiment of the present disclosure;
FIG. 15 schematically shows a block diagram of a computing module, according to an embodiment of the present disclosure;
fig. 16 schematically shows a block diagram of a second computing unit according to an embodiment of the present disclosure;
FIG. 17 schematically shows a block diagram of a building block according to an embodiment of the disclosure;
FIG. 18 schematically illustrates a block diagram of a building block according to an embodiment of the disclosure;
FIG. 19 schematically shows a block diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, necessary security measures are taken, and the customs of the public order is not violated. In the technical scheme of the disclosure, the data acquisition, collection, storage, use, processing, transmission, provision, disclosure, application and other processing are all in accordance with the regulations of relevant laws and regulations, necessary security measures are taken, and the public order and good custom are not violated.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.
In those instances where a convention analogous to "at least one of A, B, or C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, or C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.). The terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or to implicitly indicate the number of technical features indicated. Thus, features defined as "first", "second", may explicitly or implicitly include one or more of the described features.
At present, some enterprises generally establish independent information systems in each department and subordinate units, and can realize effective electronic processing in the departments. However, as informatization is further advanced, more and more information needs to be circulated across platforms and departments, and overall management is performed at a higher level. However, in practice, because the time for building each system is often different, developers are different, technical platforms and functions of different departments or business units are different, independent and different event numbers are provided in the information systems of the departments or business units, and because the information is manually input, the theme usually presents the situation that the semantics are similar but the characters are different, the information cannot be circulated in a unified format, and the information damming lake is formed. In actual work, often, a work flow needs a plurality of different departments to cooperate with each other to complete, and information cannot circulate in a unified format, so that the work efficiency of each department or business unit is greatly reduced, and the work progress is influenced.
Embodiments of the present disclosure provide a method, apparatus, electronic device, computer-readable storage medium, and computer program product for knowledge-graph-based document classification. The document classification method based on the knowledge graph comprises the following steps: for the obtained mxn i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, m is multiplied by n i The event documents are from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more, n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i An integer of (a); determining a keyword of each event document according to the word segmentation result; calculating sentence similarity of every two event documents from two different event document sets according to the keywords; and respectively constructing document knowledge graphs based on the same event according to the sentence similarity.
It should be noted that the method, apparatus, electronic device, computer-readable storage medium, and computer program product for classifying documents based on a knowledge graph of the present disclosure may be used in the field of artificial intelligence technology, and may also be used in any fields other than the field of artificial intelligence technology, such as the field of finance, and the field of the present disclosure is not limited herein.
FIG. 1 schematically illustrates an exemplary system architecture 100 to which the method, apparatus, electronic device, computer-readable storage medium, and computer program product for knowledge-graph based document classification may be applied, according to embodiments of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.
As shown in fig. 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The terminal devices 101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the terminal devices 101, 102, 103. The backend management server may analyze and process the received data such as the user request, and feed back a processing result (for example, a web page, information, or data obtained or generated according to the user request) to the terminal device.
It should be noted that the method for classifying documents based on knowledge-graphs provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the knowledge-graph based document classification apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The knowledge-graph based document classification methods provided by embodiments of the present disclosure may also be performed by a server or server cluster that is different from the server 105 and is capable of communicating with the terminal devices 101, 102, 103 and/or the server 105. Accordingly, the knowledge-graph based document classification apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the terminal devices 101, 102, 103 and/or the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
The method for classifying documents based on knowledge-graph according to the embodiment of the present disclosure will be described in detail below with reference to fig. 2 to 10 based on the scenario described in fig. 1.
FIG. 2 schematically shows a flowchart of a knowledge-graph based document categorization method according to an embodiment of the disclosure.
As shown in FIG. 2, the method for classifying a document based on a knowledge-graph of the embodiment includes operations S210 to S240.
In operation S210, the obtained m × n i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, m is multiplied by n i Each event document is from m event document sets, each event document set includes n i An event document, m is an integer of 1 or more, n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i Is an integer of (1).
As one practical way, as shown in fig. 3, operation S210 is performed on the obtained m × n i Performing word segmentation on each event document to obtain n of each event document j The word segmentation result includes operations S211 to S213.
In operation S211, m event document sets, each of which includes n, are obtained i An event document. As illustrated by the enterprise having m departments, each with an independent information system, a corresponding one may be obtained from each departmentEach event document set comprises the business events of the department, the number of the business events can be 1, a plurality of business events and 0, the specific number of the business events is determined according to whether the department has business activities, and each business event is an event document. Therefore, m event document sets of m departments, each including n, can be acquired i An event document.
In operation S212, the m event document sets are merged to obtain a document text set.
In operation S213, each event document in the document corpus is segmented to obtain n of each event document j And (5) word segmentation results. Furthermore, after segmenting words of each event document, punctuation marks are removed, repeated words are removed, preliminary information concentration is carried out, and the concentrated segmented words are used as n of each event document j And (5) word segmentation results.
The following is an example of a technical department initiating a purchase, which is only for illustration and should not be construed as a limitation to the present disclosure.
The event document obtained from the information system of the science and technology department is "the fee application for the continuation of the core network equipment MA of the branch a in 2022.
The event document acquired from the information system of the project review department is "the request of the science and technology department for the continuation fee of the core network device MA in 2022 year a".
The event document obtained from the information system of the bidding department is a core network equipment MA renewal project of 2022 of a branch A of a certain bank R city; the company of entry: b network systems limited.
The event document obtained from the information system of the contract management department is "R city a subsidiary 2022 annual core network equipment MA renewal project professional technical service contract".
The event document obtained from the information system of the financial department is' four seasons core network equipment MA renewal service charge in 2022; amount of money: 500 ten thousand yuan.
For example, 6 segmentation results of the event document can be obtained after segmenting the event document of the science and technology department, which are "about", "2022 years", "a branch", "core network device MA", "renewal", and "fee application", respectively. The event document of the project review department is segmented to obtain 8 segmentation results of the event document, namely 'science and technology department', 'about', '2022 years', 'A branch', 'core network equipment MA', 'renewal', 'expense' and 'request'. The word segmentation of the bidding department, the contract management department and the financial department is the same as that of the scientific and technological department and the project examination department, and the description is omitted.
The obtained mxn can be easily realized by operations S211 through S213 i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results.
In operation S220, a keyword of each event document is determined according to the word segmentation result.
As a possible implementation manner, as shown in fig. 4, operation S220 determines a keyword of each event document according to the word segmentation result, including operation S221 and operation S222.
In operation S221, a weight value of each word segmentation result of each event document is calculated. It should be noted that the weight value of each word segmentation result can be calculated according to the proportion of each word segmentation result in the document text set, and the higher the proportion of the word segmentation result in the document text set is, the higher the universality of the word segmentation result is, the smaller the distinction degree is, and the smaller the weight value is; the lower the proportion of the word segmentation result in the document text set is, the lower the universality of the word segmentation result is, the higher the discrimination is and the higher the weight value is.
In operation S222, the weighted value is compared with a preset weighted threshold, and the segmentation result corresponding to the weighted value satisfying the weighted threshold is used as the keyword corresponding to the event document. For example, a weight threshold may be set to a value, and when the weight value is greater than the weight threshold, a word segmentation result corresponding to the weight value is taken as a keyword of the corresponding event document; the weight value may also be set to a numerical range, and when the weight value is within the numerical range, the word segmentation result corresponding to the weight value is used as the keyword corresponding to the event document. Determining the keyword of each event document according to the word segmentation result can be facilitated through operations S221 and S222.
In operation S230, sentence similarity of every two event documents from two different event document sets is calculated according to the keyword.
As a possible implementation manner, it should be noted that, one of the two event documents from the two different event document sets includes a keywords, the event document is used as a first document, the other of the two event documents from the two different event document sets includes b keywords, the event document is used as a second document, and a and b are both integers greater than or equal to 1, as shown in fig. 5, operation S230 calculates the sentence similarity of every two event documents from the two different event document sets according to the keywords, including operation S231 to operation S233.
In operation S231, when a is greater than or equal to b, word contribution degrees between each keyword in the first document and b keywords in the second document are respectively calculated.
The event document 'fee application about the core network equipment MA renewal of the A branch in 2022' and the event document 'core network equipment MA renewal project of the A branch in the R city of a certain bank in 2022'; the company of entry: b network systems limited "for example,
assume that the keyword for determining the event document "fee application for continuation of core network device MA by a branch in 2022" is 6, and is "about", "2022", "a branch", "core network device MA", "continuation", and "fee application", respectively. Determining an event document' core network equipment MA renewal project of a certain bank R city A branch 2022; the company of entry: the keywords of the B network system limited company "are 8, and are" a certain bank "," R city "," a branch "," 2022 "," core network device MA "," renewal project "," entry company ", and" B network system limited company ", respectively. Since 8 is greater than 6, the event document "core network device MA renewal project in 2022 year by bank R city a branch; the company of entry: b network systems limited "as a first document, an event document" fee application for renewal of a subsidiary core network device MA in 2022 "as a second document.
Wherein, the word contribution degrees between "a bank" and "about", "2022 years", "a branch", "core network equipment MA", "continuous guarantee" and "fee application" are calculated, and it is assumed that the word contribution degrees are respectively alpha 1 、α 2 、α 3 、α 4 、α 5 And alpha 6
The word contribution degrees between the "R city" and the "relation", "2022 years", "A branch", "core network equipment MA", "continuation" and "fee application" are calculated, and the word contribution degrees are assumed to be beta respectively 1 、β 2 、β 3 、β 4 、β 5 And beta 6
The word contribution degrees between the "a branch" and the "about", "2022 years", "a branch", "core network device MA", "continuous guarantee" and "fee application" are calculated, and it is assumed that the word contribution degrees are γ respectively 1 、γ 2 、γ 3 、γ 4 、γ 5 And gamma 6
The word contribution degrees between "2022 years" and "about", "2022 years", "a branch", "core network device MA", "continuation", and "fee application" are calculated, assuming that the word contribution degrees are δ respectively 1 、δ 2 、δ 3 、δ 4 、δ 5 And delta 6
The word contribution degrees between the 'core network equipment MA' and the 'relation', '2022 years', 'A branch', 'core network equipment MA', 'continuous guarantee' and 'cost application' are calculated, and the word contribution degrees are assumed to be epsilon respectively 1 、ε 2 、ε 3 、ε 4 、ε 5 And ε 6
The word contribution degrees between the "renewal project" and the "relation", "2022 years", "A branch", "core network equipment MA", "renewal" and "fee application" are calculated, and the word contribution degrees are assumed to be eta respectively 1 、η 2 、η 3 、η 4 、η 5 And η 6
The word contribution degrees between the "entering company" and the "about", "2022", "A branch", "core network equipment MA", "continuous guarantee" and "fee application" are calculated, and the word contribution degrees are assumed to be theta 1 、θ 2 、θ 3 、θ 4 、θ 5 And theta 6
The word contribution degrees between "B network System Limited company" and "about", "2022 years", "A branch", "core network equipment MA", "continuation", and "fee application" are calculated, and the word contribution degrees are assumed to be lambda 1 、λ 2 、λ 3 、λ 4 、λ 5 And λ 6
In operation S232, the b word contribution degrees corresponding to each keyword in the a keywords are summed up according to the first word similarity from large to small.
Wherein the degree of contribution alpha can be given to the word 1 、α 2 、α 3 、α 4 、α 5 And alpha 6 The words with the first rank are assumed to have a contribution degree alpha according to the sequence from large to small 3 . Can give a degree of contribution beta to a word 1 、β 2 、β 3 、β 4 、β 5 And beta 6 In order from big to small, assume that the word contribution degree of the first rank is beta 3 . Can give contribution degree gamma to words 1 、γ 2 、γ 3 、γ 4 、γ 5 And gamma 6 In order from big to small, assume that the word contribution degree of the first rank is gamma 3 . Can give a degree of contribution delta to a word 1 、δ 2 、δ 3 、δ 4 、δ 5 And delta 6 In order of descending order, the word contribution degree of the first ranking is assumed to be delta 2
Can give a degree of contribution epsilon to a word 1 、ε 2 、ε 3 、ε 4 、ε 5 And ε 6 Ordering from big to small, assuming that the word contribution degree of the first ranking is epsilon 4 . Can give a degree of contribution η to a word 1 、η 2 、η 3 、η 4 、η 5 And η 6 Ordering from big to small, the word contribution degree of the first ranking is assumed to be eta 5 . Can give a degree of contribution theta to a word 1 、θ 2 、θ 3 、θ 4 、θ 5 And theta 6 In the order from big to small, the word contribution degree of the first ranking is assumed to be theta 4 . Can contribute a degree lambda to a word 1 、λ 2 、λ 3 、λ 4 、λ 5 And λ 6 In order from big to small, assume that the word contribution degree of the first rank is lambda 4
Therefore, the sum T = α can be obtained 33324544
In operation S233, the sums are averaged, and the average is taken as the sentence similarity of two event documents from two different sets of event documents. It can be appreciated that since the first document has 8 keywords, the average = T/8. The calculation of sentence similarity between two event documents from two different event document sets according to the keyword can be conveniently realized through operations S231 to S233.
According to some embodiments of the present disclosure, as shown in fig. 6, the operation S231 of calculating the word contribution degree between the keyword in the first document and the keyword in the second document includes operations S2311 to S2313.
In operation S2311, when the keywords in the first document are the same as the keywords in the second document, the word contribution degree between the keywords is a product of the weight values of the keywords.
In operation S2312, when the keywords in the first document are different from the keywords in the second document and the cosine similarity between the keywords can be found in the pre-trained dictionary, the word contribution degree between the keywords is a product of the cosine similarity and a weight value of the keywords.
In operation S2313, when the keywords in the first document are different from the keywords in the second document and the cosine similarity between the keywords cannot be found in the pre-trained dictionary, the word contribution degree between the keywords is 0. The pre-training dictionary is a pre-constructed technical dictionary, and the cosine similarity between words is recorded in the pre-training dictionary.
In the following, the word contribution degree between "a branch" and "about", "2022", "a branch", "core network device MA", "renewal" and "fee application" is calculated as an example, and it is assumed that the weight value of "a branch" is h 1 Let the weight value of "about" be h 2 Let "2022 years" be weighted by h 3 Let's assume that the weight value of ' core network device MA ' is h 4 Let the weight value of "keep alive" be h 5 Let the weight value of "fee application" be h 6
The "a branch" is different from the "about", and it is assumed that the pre-training dictionary does not record the cosine similarity between the "a branch" and the "about", and thus the word contribution γ between the "a branch" and the "about" is 1 Is 0.
The "a branch" is different from the "2022 year", and it is assumed that the pre-training dictionary does not record the cosine similarity between the "a branch" and the "2022 year", and therefore the word contribution γ between the "a branch" and the "2022 year" is 2 Is 0.
Wherein, the line A is the same as the line A, therefore, the word contribution degree gamma between the line A and the line A 3 Is h 1 2
The "a branch" is different from the "core network device MA", and it is assumed that the cosine similarity between the "a branch" and the "core network device MA" is distance (a) recorded in the pre-training dictionary i ,b j ) Hence, the word contribution γ between "a branch" and "core network device MA 4 To measure cosine similarity distance (a) i ,b j ) Acting on the weight value h 1 And h 4 The product of (a). In particular, γ 4 Can be that
Figure BDA0003824169380000131
The "a branch" is different from the "continuation", and it is assumed that the pre-training dictionary does not record the cosine similarity between the "a branch" and the "continuation", and therefore the word contribution γ between the "a branch" and the "continuation" is 5 Is 0.
The pre-training dictionary is different from the expense application, and the cosine similarity between the expense application and the division A is not recorded in the pre-training dictionary, so the contribution degree gamma of the words between the expense application and the division A is different from the contribution degree gamma of the words between the division A and the division A 6 Is 0.
Calculating the word contribution degree between the keyword in the first document and the keyword in the second document may be facilitated through operations S2311 to S2313.
In operation S240, document knowledge graphs based on the same event are respectively constructed according to the sentence similarity.
As a possible implementation manner, as shown in fig. 7, operation S240 constructs document knowledge graphs based on the same event according to the sentence similarity, respectively, including operations S241 to S244.
In operation S241, in step one, traversing all the sentence similarities, taking the sentence similarity greater than 0 in descending order, taking the first-ranked sentence similarity as a connecting edge, and taking two event documents corresponding to the sentence similarity as nodes to construct a knowledge graph. It can be understood that, in the document text set, each two event documents have a sentence similarity, the sentence similarity may be greater than or equal to 0, the sentence similarities in the document text set are sorted to obtain a sorting order, the sentence similarity with the first rank is taken as a connecting edge, the two event documents corresponding to the sentence similarity are taken as nodes, and a knowledge graph may be constructed.
For example, as shown in the knowledge graph of fig. 8, assuming that the sentence similarity 1 constructing the connecting edge 1 is the sentence similarity ranked first and the sentence similarity 1 is the sentence similarity between the event document 1 and the event document 2, the knowledge graph may be constructed by connecting the event document 1 and the event document 2 with the sentence similarity 1 as the connecting edge, with the event document 1 and the event document 2 as nodes.
In operation S242, in step two, the sentence similarities greater than 0 related to the edge nodes in the knowledge graph are sorted from large to small, the first ranked sentence similarity is taken as a connecting edge, another event document corresponding to the sentence similarity is taken as a new edge node, and the knowledge graph is expanded. Wherein, an edge node can be understood as a node with only one connecting edge, and in the knowledge graph constructed in the step one, both the event document 1 and the event document 2 are edge nodes.
Therefore, the sentence similarity degrees which are related to the event document 1 and are greater than 0 can be sequenced from large to small, the sentence similarity degree with the first rank is taken as a connecting edge, and the other event document 3 corresponding to the sentence similarity degree is taken as a new edge node; the sentence similarity degrees which are related to the event document 2 and are larger than 0 can be sorted from large to small, the sentence similarity degree with the first rank is taken as a connecting edge, and another event document 4 corresponding to the sentence similarity degree is also taken as a new edge node. The sentence similarity greater than 0 related to the event document 1 referred to herein is a sentence similarity greater than 0 other than the sentence similarity between the event document 1 and the event document 2; the sentence similarity greater than 0 related to the event document 2 is the sentence similarity greater than 0 other than the sentence similarity between the event document 1 and the event document 2.
In operation S243, step three, step two is executed in a loop until no new edge node can be found, and the knowledge graph in step two is taken as the document knowledge graph based on the same event. Repeating step two, for example, the event document 5 can be found as a new edge node having a continuous edge with the event document 3, and the event document 6 can be found as a new edge node having a continuous edge with the event document 4. Referring to fig. 8, this is by way of illustration only and should not be construed to limit the present disclosure.
In operation S244, step four, the first step to the third step are executed in a loop until all the sentence similarities greater than 0 are used as continuous edges in the document knowledge graph. It can be understood that a plurality of event documents are included in the document text set, the plurality of event documents cover a plurality of business matters, a plurality of event documents based on the same business matters can be constructed into a document knowledge graph by performing the steps one to three, and the document knowledge graph corresponding to the business matters one to one can be constructed by performing the steps one to three in a circulating manner.
The document knowledge graph based on the same event can be conveniently constructed according to the sentence similarity through the operations S241 to S244.
Further, as shown in fig. 9, the operation S240 of respectively constructing document knowledge graphs based on the same event according to the sentence similarity further includes an operation S245.
In operation S245, step five, a document knowledge graph is constructed with the event documents corresponding to the sentence similarity equal to 0 as nodes. It is understood that a sentence similarity equal to 0 may indicate that there are no other event documents related to the event document, that the business matters corresponding to the event document may relate to only one department, and thus a document knowledge graph with only one node may be constructed with the event document as a node. The method for constructing the document knowledge graph based on the same event respectively according to the sentence similarity can be further improved through operation S245.
In some embodiments of the present disclosure, as shown in fig. 10, the method of classifying a knowledge-graph-based document may further include operation S250.
In operation S250, each document knowledge-graph based on the same event is mapped to a row or a column of a table presentation. Therefore, high-level management personnel can know the working progress and flow of each department at a glance, and management is facilitated.
According to the document classification method based on the knowledge graph of the embodiment of the disclosure, the obtained m multiplied by n is classified i The word segmentation is carried out on each event document, so that n of each event document can be obtained j Word segmentation results; determining keywords of each event document according to the word segmentation result; according to the keywords, the sentence similarity of every two event documents from two different event document sets can be calculated; according to the sentence similarity, document knowledge graphs based on the same event can be respectively constructed. Thereby, a plurality of events covering a plurality of business events can be realizedThe file knowledge map corresponding to multiple business items is constructed, so that a work flow needing to be completed by matching of multiple different departments can circulate in a unified format, the work efficiency of each department or business unit is improved, the work schedule is accelerated, and the management of an enterprise manager is facilitated.
In addition, the knowledge graph-based document classification method does not need to specify the number of event document classification types and an additional learning process, can classify the event documents according to semantic information under the unsupervised condition, and is high in functionality; the method does not need expensive special computing equipment, can quickly realize functions on a common terminal, is convenient to deploy and has strong practicability; the method and the system can conveniently integrate and embed the current mainstream business processing flow and frame, and have strong popularization.
A document classification method according to an embodiment of the present disclosure is described in detail below with reference to fig. 11. It is to be understood that the following description is illustrative only and is not intended as a specific limitation of the disclosure.
Taking the example of initiating one purchase by the science and technology department, the following process is involved, as shown in table 1. The events are processed in different service systems of different service departments, independent and different event numbers are provided in each system, and because the information is manually input, the subjects usually present the situations of similar semantics and different characters. These systems are generally capable of outputting summary text data in different formats according to their own specifications. However, in actual work, a work flow often needs a plurality of different departments to cooperate with each other to complete, and independent events are formed in each department.
Therefore, the internal events of each independent department can be associated to form a global view, and the working efficiency can be greatly improved. The core of the method is to identify and analyze the correlation of the event titles, restore the workflow of the whole event through the correlation of the event content, and further show the working progress. With the popularization of digital construction and the progress of artificial intelligence technology, it is gradually possible to convert the traditional data strict matching mode into the natural language semantic matching mode.
For example, the data in table 1 is screened and matched based on natural language processing technology, and the related data can be gathered together through semantic analysis to form an event processing progress table, so that the related managers can conveniently consult the event processing progress table, and the event processing progress of the related departments can be supervised and urged.
TABLE 1
Figure BDA0003824169380000171
The document classification method can analyze the event description texts output by different independent systems, and classify according to semantic correlation among the texts so as to restore the workflow progress.
The main flow chart of the document classification method of the present disclosure is shown in fig. 1. The document classification method comprises a step 1 to a step 6.
Step 1, collecting the event texts output by each independent system, and generally marking the events of each business unit as a data set. These data sets are combined to form the final set of title text.
For example, the project application department outputs an application list xls, and a subject column is extracted to form an application data set. Json extracts the prjName field to form an approval data set. By analogy, an own header data set is formed for each operation step. These header data are then merged to form a global header data set for use in step 4.
And 2, sorting the dictionary of the natural language processing engine, and adding hot words in the current year to contribute to improving the recognition accuracy. For example, according to current time of year, hot words such as "line", "big data", and "Xin Chuan" are added.
And 3, performing word segmentation processing on the event text, removing punctuation marks from the word segmentation result of each sentence of text, removing repeated words, and performing preliminary information concentration.
And 4, extracting information from the event text word bag by using a TF-IDF (Trans-inverse document frequency) algorithm, and setting a threshold value to extract words with high information weight. For example, the text: "an expense application for continuation of the xx branch core network device MA in asdf year" can obtain a word segmentation result "termlist" after performing word segmentation: "[ about/p, asdf/m, year/qt, xx/ns, division/n, core/n, network device/gi, MA/nx, renewal/nz,/ude 1, cost/n, application/v ]", and
and (3) performing weight analysis on the word segmentation results of all the titles through a TF-IDF algorithm to extract high-weight words, wherein the words are the most special words in the title and contain the largest information amount. Here, the weight is greater than 4.5 as a threshold. "keywords": "[ continuation =6.3471075307174685, core =5.653960350157523, and network device =4.642359438479043 ]".
The second heading: xxx requests about the renewal cost of the core network equipment MA of the xx year division, and the word division result is "termlist": "[ xxx section/n, for/p, 20xx/m, year/n, branch/n, core/n, network devices/gi, MA/nx, renewal/nz, cost/n,/ude 1, invitation/v ]", high-weight word "keywords": "[ continuation =6.3471075307174685, core =5.653960350157523, and network device =4.642359438479043 ]".
It can be seen that the two texts, although different, can be split into the same core vocabulary.
And 5, calculating the similarity between the two title keywords in different data sets. And taking the sentence with longer length between the two sentences as a reference, sequentially comparing the similarity degree of each word in the first sentence word bag and each important word in the second sentence word bag, and summing and averaging the maximum values. Each comparison forms an edge in the graph, and the attribute of the edge is the similarity of two nodes, as shown in formula (1).
Figure BDA0003824169380000191
Wherein n is the number of the sentence words containing more important words.
Sim(W i 1 ,W i 2 ) Is the semantic similarity of two words, and is defined as the table2, respectively.
TABLE 2
Figure BDA0003824169380000192
Where distance (a, b) is the semantic similarity value provided by the external NLP pre-training model, usually the vector cosine value or the vector distance of two words. And performing normalization processing by using a hyperbolic function.
For example, through calculation, the similarity data of two titles can be expressed as follows:
{ "id 1 keywords": "[ item =7.4457198193855785, p =7.4457198193855785, implementer =7.040254711277414, cybersecurity =6.529429087511423, xxxx bank =5.740971727147153, full line =5.003372784016374 ]", and "id 2": "30260", "precision": "0.09200375173568089", "id 1": "10190", "id 2 keywords": "[ exterior =7.4457198193855785, resource =7.4457198193855785, research and development =6.3471075307174685, general =6.059425458265688, collective =4.704879795460378, unified =4.612506475329362 ]" }
Wherein the "similarity": "0.09200375173568089" means that the similarity of the core words of the two titles is 0.09, which means that the two titles are not basically related.
Take another set of data as an example
{ "id 1 keywords": "[ cache =6.752572638825633, content =6.752572638825633 ]", "id 2": "50131", "precision": "44.77711575987957", "id 1": "1021", "id 2 keywords": "[ Contents =6.752572638825633, buffer =6.752572638825633 ]" }
The "similarity" of title 50131 and title 1021 in the set of data reached 44.77711575987957 ", indicating that the two titles are two titles with very high correlation.
And 6, establishing connection among the titles with high correlation according to the similarity of the phrases, gradually aggregating to form a tree, and further forming a forest by different object flows. The steps for creating a correlation forest are briefly as follows:
a. selecting the side with the maximum weight in all the free sides, and if the weight is greater than a preset threshold value, taking the side as an initial node of a new tree; if the weight is less than the predetermined threshold, the algorithm ends.
b. Scanning all external connection edges of the current tree, sorting the edges according to the weight, and trying in sequence.
c. If the average weight of the selected new edge and all the nodes on the tree is larger than the threshold value, adding the edge and the new node into the tree, and re-executing the step b; if the average weight is smaller than the threshold value, selecting the next edge; if no new external edge exists, the creation of the tree is ended, and the step a is returned.
For example, the program scans all currently dangling edges, and finds the following edge information of title 2012 and title 4144 as the current edge with the largest weight, wherein the similarity is 41. This edge will be the root of a new tree.
2012/4144[41.28876695359808] [ commercial product =6.752572638825633, bulk =6.752572638825633, off-the-shelf = 6.7525728825633, xxx =6.059425458265688, liquidation = 6.0594254582688 ] [ bulk = 6.7525728825633, off-the-shelf =6.752572638825633, commercial product =6.752572638825633, xxx = 6.0594258265688, liquidation = 6.0594458265688
The program then scans all edges connecting node 2012 and node 4144 and selects the edge with the highest similarity.
Addition side =2012/50176[ 34.413102931546 ]
Addition side =2012/50527[12.652388555457167]
Addition side =2012/50251[10.433903179072201]
Addition side =2012/4129[9.18752965796729]
Addition side =4144/50176[34.413102931542156]
Addition side =4144/50527[12.652388555457167]
Adding side =4144/50251[10.433903179072201]
Return MAX ex =4144/50176[34.413102931542156] [ bulk =6.752572638825633, spot = 6.75257263882525633, commercial = 6.75257288633, xxx =6.059425458265688, liquidation = 6.0594254582688 ] [ bulk =6.752572638825633, spot = 6.75257263882525633, commercial = 6.7525638825633, xxx = 6.0594254565688, liquidation = 6.05944582688, first payment = 5.656035039350523 ]
The program would select the edge connecting nodes 4144 and 50176 in the data and compare this edge with all the nodes in the existing tree, if it is satisfied, add node 50176 and finally form a tree with three nodes. After that, the program continues to loop and search for the adjacent edges of the three nodes, and no other edges with similarity higher than the threshold are found, i.e. the creation of one tree is declared to be finished. Returning to step a to attempt to create a new tree.
The final event progress may be presented as shown in table 3.
TABLE 3
Figure BDA0003824169380000221
The present disclosure has the following advantages:
(1) The functionality is strong: the number of classification types is not required to be specified, an additional learning process is not required, and the title phrases can be classified according to semantic information under the unsupervised condition.
(2) The practicability is strong: expensive special computing equipment is not needed, functions can be quickly realized on a common PC, and the system is convenient to deploy.
(3) The popularization is strong: can conveniently integrate and embed the current mainstream business processing flow and framework.
Based on the above knowledge-graph-based document classification method, the present disclosure also provides a knowledge-graph-based document classification device 10. The knowledge-graph based document categorization apparatus 10 will be described in detail below in conjunction with FIGS. 12-18.
FIG. 12 schematically shows a block diagram of the knowledge-graph based document categorization apparatus 10 according to an embodiment of the present disclosure.
The document classifying device 10 based on the knowledge graph comprises a word segmentation module 1, a determination module 2, a calculation module 3 and a construction module 4.
A segmentation module 1, wherein the segmentation module 1 is configured to perform operation S210: for the obtained mxn i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, m is multiplied by n i The event documents are from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more, n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i Is an integer of (1).
A determining module 2, the determining module 2 being configured to perform operation S220: and determining the key words of each event document according to the word segmentation result.
Calculation module 3, the calculation module 3 is configured to perform operation S230: and calculating the sentence similarity of every two event documents from two different event document sets according to the keywords.
A building block 4, the building block 4 being configured to perform operation S240: and respectively constructing document knowledge graphs based on the same event according to the sentence similarity.
Fig. 13 schematically shows a block diagram of the structure of the word segmentation module 1 according to an embodiment of the present disclosure. The word segmentation module 1 comprises an acquisition unit 11, a first determination unit 12 and a second determination unit 13.
An obtaining unit 11, wherein the obtaining unit 11 is configured to obtain m event document sets, each of which includes n i An event document.
And a first determining unit 12, where the first determining unit 12 is configured to merge the m event document sets to obtain a document text set.
A second determining unit 13, where the second determining unit 13 is configured to perform word segmentation on each event document in the document text set to obtain n of each event document j And (5) word segmentation results.
Fig. 14 schematically shows a block diagram of the structure of the determination module 2 according to an embodiment of the present disclosure. The determination module 2 comprises a first calculation unit 21 and a third determination unit 22.
A first calculating unit 21, the first calculating unit 21 is used for calculating the weight value of each word segmentation result of each event document.
And the third determining unit 22, wherein the third determining unit 22 is configured to compare the weight value with a preset weight threshold, and use the word segmentation result corresponding to the weight value meeting the weight threshold as the keyword corresponding to the event document.
Fig. 15 schematically shows a block diagram of the computing module 3 according to an embodiment of the present disclosure. One of two event documents from two different event document sets includes a keywords, the event document is taken as a first document, the other of the two event documents from the two different event document sets includes b keywords, the event document is taken as a second document, a and b are integers greater than or equal to 1, and the calculation module 3 includes a second calculation unit 31, a summation unit 32 and a fourth determination unit 33.
And the second calculating unit 31 is used for calculating the word contribution degree between each keyword in the first document and b keywords in the second document when a is larger than or equal to b.
And the adding unit 32 is used for adding the b word contribution degrees corresponding to each keyword in the a keywords according to the first word similarity degree in the descending order.
And a fourth determining unit 33, wherein the fourth determining unit 33 is configured to average the sums, and take the average as the sentence similarity of two event documents from two different event document sets.
Fig. 16 schematically shows a block diagram of the second calculation unit 31 according to an embodiment of the present disclosure. The second calculation unit 31 includes a first determination element 311, a second determination element 312, and a third determination element 313.
A first determination component 311, the first determination component 311 being configured to, when the keywords in the first document are the same as the keywords in the second document, multiply the word contribution degrees between the keywords by the weight values of the keywords.
A second determining element 312, where the second determining element 312 is configured to, when the keyword in the first document is different from the keyword in the second document and the cosine similarity between the keywords can be found in the pre-trained dictionary, take the word contribution degree between the keywords as a product of the cosine similarity and a weight value that is applied to the keywords.
A third determining element 313, wherein the third determining element 313 is configured to, when the keywords in the first document are different from the keywords in the second document and the cosine similarity between the keywords cannot be queried in the pre-training dictionary, make the word contribution degree between the keywords be 0.
Fig. 17 schematically shows a block diagram of the construction module 4 according to an embodiment of the present disclosure. The building block 4 comprises a first building element 41, an expansion element 42, a first circulation element 43 and a second circulation element 44.
The first constructing unit 41 is configured to traverse all the sentence similarities, take the sentence similarity with the first rank as a connecting edge according to the sentence similarities with a value greater than 0 from large to small, and take two event documents corresponding to the sentence similarities as nodes to construct the knowledge graph.
And the expanding unit 42 is used for sequencing the sentence similarity degrees which are related to the edge nodes in the knowledge graph and are greater than 0 in a descending order, taking the sentence similarity degree with the first rank as a connecting edge, taking another event document corresponding to the sentence similarity degree as a new edge node, and expanding the knowledge graph.
And a first circulation unit 43, wherein the first circulation unit 43 is used for circularly executing the step two until no new edge node can be found, and the knowledge graph in the step two is used as a document knowledge graph based on the same event.
And the second circulating unit 44 is used for circularly executing the first step to the third step until all the sentence similarity degrees larger than 0 are used as connecting edges in the document knowledge graph.
Fig. 18 schematically shows a block diagram of the construction module 4 according to an embodiment of the present disclosure. The building block 4 further comprises a second building element 45.
And a second constructing unit 45, wherein the second constructing unit 45 is used for constructing the document knowledge graph by taking the event document corresponding to the sentence similarity equal to 0 as a node.
The knowledge-graph-based document classification device 10 according to the embodiment of the disclosure, by comparing the obtained m × n i The word segmentation is carried out on each event document, so that n of each event document can be obtained j Word segmentation results; determining the key words of each event document according to the word segmentation result; according to the keywords, the sentence similarity of every two event documents from two different event document sets can be calculated; according to the sentence similarity, document knowledge graphs based on the same event can be respectively constructed. Therefore, a plurality of event documents covering a plurality of business matters can be constructed into the document knowledge graph corresponding to the business matters one by one, so that a work flow needing to be completed by the cooperation of a plurality of different departments can circulate in a unified format, the work efficiency of each department or business unit is improved, the work schedule is accelerated, and the management of an enterprise manager is facilitated.
In addition, the knowledge graph-based document classification method does not need to specify the number of event document classification types and an additional learning process, can classify the event documents according to semantic information under the unsupervised condition, and is high in functionality; the method does not need expensive special computing equipment, can quickly realize functions on a common terminal, is convenient to deploy and has strong practicability; the method and the system can conveniently integrate and embed the current mainstream business processing flow and frame, and have strong popularization.
In addition, according to the embodiment of the present disclosure, any plurality of the word segmentation module 1, the determination module 2, the calculation module 3, and the construction module 4 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module.
According to an embodiment of the present disclosure, at least one of the word segmentation module 1, the determination module 2, the calculation module 3 and the construction module 4 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of three implementations of software, hardware and firmware, or in a suitable combination of any of them.
Alternatively, at least one of the segmentation module 1, the determination module 2, the calculation module 3 and the construction module 4 may be at least partially implemented as a computer program module, which, when executed, may perform a corresponding function.
Fig. 19 schematically shows a block diagram of an electronic device adapted to implement the above method according to an embodiment of the present disclosure.
As shown in fig. 19, an electronic apparatus 900 according to an embodiment of the present disclosure includes a processor 901 which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage portion 908 into a Random Access Memory (RAM) 903. Processor 901 may comprise, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 901 may also include on-board memory for caching purposes. The processor 901 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.
In the RAM 903, various programs and data necessary for the operation of the electronic apparatus 900 are stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. The processor 901 performs various operations of the method flows according to the embodiments of the present disclosure by executing programs in the ROM 902 and/or the RAM 903. Note that the programs may also be stored in one or more memories other than the ROM 902 and the RAM 903. The processor 901 may also perform various operations of the method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.
Electronic device 900 may also include input/output (I/O) interface 905, input/output (I/O) interface 905 also connected to bus 904, according to an embodiment of the present disclosure. The electronic device 900 may also include one or more of the following components connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The driver 910 is also connected to an input/output (I/O) interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.
The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.
According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, a computer-readable storage medium may include the ROM 902 and/or the RAM 903 described above and/or one or more memories other than the ROM 902 and the RAM 903.
Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flow chart. The program code is for causing a computer system to perform the methods of the embodiments of the disclosure when the computer program product is run on the computer system.
The computer program performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure when executed by the processor 901. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal over a network medium, distributed, and downloaded and installed via the communication section 909 and/or installed from the removable medium 911. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
In such an embodiment, the computer program may be downloaded and installed from a network via the communication section 909 and/or installed from the removable medium 911. The computer program, when executed by the processor 901, performs the above-described functions defined in the system of the embodiment of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.
In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the disclosure, and these alternatives and modifications are intended to fall within the scope of the disclosure.

Claims (11)

1. A method for classifying documents based on a knowledge graph is characterized by comprising the following steps:
for the obtained mxn i Carrying out word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, the m is multiplied by n i Each event document is from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more, n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i An integer of (d);
determining a keyword of each event document according to the word segmentation result;
calculating sentence similarity of every two event documents from two different event document sets according to the keywords; and
and respectively constructing document knowledge graphs based on the same event according to the sentence similarity.
2. The method according to claim 1, wherein the obtained m x ni event documents are participled to obtain n of each event document j A word segmentation result comprising:
obtaining m event document sets, wherein each event document set comprises n i An event document;
merging the m event document sets to obtain a document text set; and
segmenting each event document in the document text set to obtain n of each event document j And (5) word segmentation results.
3. The method of claim 1, wherein determining keywords for each event document according to the word segmentation result comprises:
calculating a weighted value of each word segmentation result of each event document; and
and comparing the weight value with a preset weight threshold value, and taking the word segmentation result corresponding to the weight value meeting the weight threshold value as a keyword corresponding to the event document.
4. The method according to claim 3, wherein the two event documents from the two different event document sets include a keywords, the event document is taken as a first document, the other event document from the two different event document sets includes b keywords, the event document is taken as a second document, a and b are integers greater than or equal to 1, and calculating the sentence similarity of every two event documents from the two different event document sets according to the keywords comprises:
when a is larger than or equal to b, respectively calculating the word contribution degree between each keyword in the first document and each of b keywords in the second document;
summing the b word contribution degrees corresponding to each keyword in the a keywords according to the first-ranked word similarity from big to small; and
and averaging the sums, and taking the average as the sentence similarity of two event documents from two different event document sets.
5. The method of claim 4, wherein calculating a word contribution between the keywords in the first document and the keywords in the second document comprises:
when the keywords in the first document are the same as the keywords in the second document, the word contribution degree between the keywords is the product of the weighted values of the keywords;
when the key words in the first document are different from the key words in the second document and the cosine similarity among the key words can be inquired in a pre-training dictionary, the word contribution degree among the key words is the product of the cosine similarity and the weight value of the key words; and
and when the key words in the first document are different from the key words in the second document and the cosine similarity between the key words cannot be inquired in the pre-training dictionary, the word contribution degree between the key words is 0.
6. The method according to claim 1, wherein the respectively constructing document knowledge graphs based on the same event according to the sentence similarity comprises:
step one, traversing all the sentence similarities, taking the sentence similarity with the rank of more than 0 as a connecting edge according to the sentence similarity with the rank of the first according to the sequence from large to small, and taking two event documents corresponding to the sentence similarity as nodes to construct a knowledge graph;
step two, sequencing the sentence similarity degrees which are related to the edge nodes in the knowledge graph and are larger than 0 in the descending order, taking the sentence similarity degree with the first rank as a connecting edge, taking another event document corresponding to the sentence similarity degree as a new edge node, and expanding the knowledge graph;
step three, circularly executing the step two until a new edge node cannot be found, and taking the knowledge graph in the step two as a document knowledge graph based on the same event;
and step four, circularly executing the step one to the step three until all the sentence similarity degrees which are larger than 0 are used as connecting edges in the document knowledge graph.
7. The method according to claim 6, wherein the constructing document knowledge graphs based on the same event according to the sentence similarity respectively further comprises:
and step five, taking the event document corresponding to the sentence similarity equal to 0 as a node to construct a document knowledge graph.
8. A knowledge-graph-based document categorization apparatus, comprising:
a word segmentation module for performing segmentation on the obtained mxn i Performing word segmentation on each event document to obtain n of each event document j And (5) word segmentation results. Wherein, the m is multiplied by n i Each event document is from m event document sets, each event document set comprises n i An event document, m is an integer of 1 or more, n i Is an integer of 0 or more, i is an integer of 1 to m or less, n j Is an integer of 1 or more, j is 1 or more and m × n or less i An integer of (d);
the determining module is used for determining the key words of each event document according to the word segmentation result;
the calculation module is used for calculating the sentence similarity of every two event documents from two different event document sets according to the keyword; and
and the construction module is used for executing the respective construction of the document knowledge graphs based on the same event according to the sentence similarity.
9. An electronic device, comprising:
one or more processors;
one or more memories for storing executable instructions that, when executed by the processor, implement the method of any one of claims 1-7.
10. A computer-readable storage medium, characterized in that the storage medium has stored thereon executable instructions which, when executed by a processor, implement the method according to any one of claims 1 to 7.
11. A computer program product comprising a computer program comprising one or more executable instructions which, when executed by a processor, implement the method of any one of claims 1 to 7.
CN202211068142.2A 2022-08-31 2022-08-31 Document classification method, apparatus, electronic device, medium, and computer program product Pending CN115422358A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211068142.2A CN115422358A (en) 2022-08-31 2022-08-31 Document classification method, apparatus, electronic device, medium, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211068142.2A CN115422358A (en) 2022-08-31 2022-08-31 Document classification method, apparatus, electronic device, medium, and computer program product

Publications (1)

Publication Number Publication Date
CN115422358A true CN115422358A (en) 2022-12-02

Family

ID=84203048

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211068142.2A Pending CN115422358A (en) 2022-08-31 2022-08-31 Document classification method, apparatus, electronic device, medium, and computer program product

Country Status (1)

Country Link
CN (1) CN115422358A (en)

Similar Documents

Publication Publication Date Title
US20210397980A1 (en) Information recommendation method and apparatus, electronic device, and readable storage medium
US20170235820A1 (en) System and engine for seeded clustering of news events
US20200151155A1 (en) Classifying an unmanaged dataset
US8245135B2 (en) Producing a visual summarization of text documents
CN110489558B (en) Article aggregation method and device, medium and computing equipment
US20160350294A1 (en) Method and system for peer detection
US11023503B2 (en) Suggesting text in an electronic document
US20120303637A1 (en) Automatic wod-cloud generation
US11361030B2 (en) Positive/negative facet identification in similar documents to search context
US10755332B2 (en) Multi-perceptual similarity detection and resolution
US11182540B2 (en) Passively suggesting text in an electronic document
CA2956627A1 (en) System and engine for seeded clustering of news events
US20160085848A1 (en) Content classification
CN116109373A (en) Recommendation method and device for financial products, electronic equipment and medium
US11048711B1 (en) System and method for automated classification of structured property description extracted from data source using numeric representation and keyword search
KR20190081622A (en) Method for determining similarity and apparatus using the same
CN115329207B (en) Intelligent sales information recommendation method and system
CN115329083A (en) Document classification method and device, computer equipment and storage medium
CN115422358A (en) Document classification method, apparatus, electronic device, medium, and computer program product
CN110837525B (en) Data processing method and device, electronic equipment and computer readable medium
CN117056392A (en) Big data retrieval service system and method based on dynamic hypergraph technology
CN113672705A (en) Resume screening method, apparatus, device, medium and program product
Hong et al. An efficient tag recommendation method using topic modeling approaches
Si et al. A conditional random field model for name disambiguation in national natural science foundation of china fund
CN113177116B (en) Information display method and device, electronic equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination