WO2021114810A1 - Graph structure-based official document recommendation method, apparatus, computer device, and medium - Google Patents

Graph structure-based official document recommendation method, apparatus, computer device, and medium Download PDF

Info

Publication number
WO2021114810A1
WO2021114810A1 PCT/CN2020/116744 CN2020116744W WO2021114810A1 WO 2021114810 A1 WO2021114810 A1 WO 2021114810A1 CN 2020116744 W CN2020116744 W CN 2020116744W WO 2021114810 A1 WO2021114810 A1 WO 2021114810A1
Authority
WO
WIPO (PCT)
Prior art keywords
official document
official
document
topic
preset
Prior art date
Application number
PCT/CN2020/116744
Other languages
French (fr)
Chinese (zh)
Inventor
谢静文
阮晓雯
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021114810A1 publication Critical patent/WO2021114810A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of data analysis in the field of big data, and in particular to a method, device, computer equipment, and medium for recommending official documents based on a graph structure.
  • An official document recommendation method based on graph structure including:
  • the distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key
  • the distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
  • the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
  • the retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
  • An official document recommendation device based on graph structure including:
  • the first recording module is used to obtain a variety of official documents with different types of official documents, determine the characteristic words in the obtained official documents according to TF-IDF based on preset word statistical characteristics, and filter according to the TF-IDF the occurrence frequency is greater than or equal to the expected Set frequency characteristic words, and record the selected characteristic words as the corresponding keyword tags of the official document;
  • the second recording module is used to input the official document into a preset LDA topic model, calculate the distribution probability matrix of the text topic-keywords in the official document through the LDA topic model, and then obtain the LDA topic model according to
  • the text topic-keyword distribution probability matrix of the official document selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them
  • the distribution probability matrix of the text topic-keyword includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text theme of the official document;
  • the first generation module is configured to generate official document attributes according to the keyword tags and the topic tags;
  • the establishment module is used to obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure according to the record data of the official document and the attribute of the official document through the Neo4j framework;
  • the official document recommendation database Contains multiple graph structures, one graph structure corresponds to the official document of at least one of the document types, and one graph structure contains multiple nodes connected to each other; one of the nodes represents the record data, all One of the keyword tag and the topic tag;
  • the calculation module is configured to receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the search content and the node degree.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
  • the distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key
  • the distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
  • the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
  • the retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
  • One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key
  • the distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
  • the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
  • the retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
  • Keyword tags based on TF-IDF will be relatively objective, and keyword tags are obtained based on statistical methods, which can guarantee the obtained keyword tags It has the advantages of comprehensive consideration and low error rate, and the number of keyword tags given is in a controllable state, which can ensure that the keyword tags are relatively rich; the topic tags given based on the LDA topic model will be relatively objective, and each keyword is The corresponding text topic is obtained based on the model calculation method, thereby ensuring that the obtained text topic label has the advantages of comprehensive consideration and low error rate; SimRank calculates the similarity between the search content entered by the user and the node, because SimRank combines The features in the text of a variety of official documents can therefore recommend more relevant target official documents to improve the accuracy and efficiency of the recommendation. The similarity between objects measured by SimRank is more in line with human intuitive judgment, and the similarity is To determine the order of the output target documents, the user experience can be improved.
  • FIG. 1 is a schematic diagram of an application environment of an official document recommendation method based on a graph structure in an embodiment of the present application
  • FIG. 2 is a flowchart of a method for recommending official documents based on a graph structure in an embodiment of the present application
  • FIG. 3 is a schematic diagram of the structure of an official document recommendation device based on a graph structure in an embodiment of the present application
  • Fig. 4 is a schematic diagram of a computer device in an embodiment of the present application.
  • the method for recommending official documents based on the graph structure provided in this application can be applied to the application environment as shown in FIG. 1, in which the client communicates with the server through the network.
  • the client can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented as an independent server or a server cluster composed of multiple servers.
  • a method for recommending an official document based on a graph structure is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:
  • TF-IDF Term frequency—Inverse Document Frequency
  • IDF the frequency of feature words in other official documents
  • feature words generally refer to the content of the official document that can represent the official document Words such as personal pronouns, modal auxiliary words and conjunctions are generally not included in the characteristic words, while specific executive verbs in official documents can be included in the characteristic words.
  • the statistical characteristics of the words can be set according to the needs to determine the The characteristic words obtained from the official document, therefore, the statistical characteristics of the words in this embodiment include the characteristics of a variety of characteristic words, such as the verb characteristics corresponding to the executive verbs in the characteristic words; the higher the frequency of occurrence of the characteristic words, the characteristic words can be explained.
  • the representativeness and importance in the official document is very high.
  • the preset frequency can be set according to the application field. However, because some characteristic words are biased toward a certain application field, in the official document field of this embodiment, the preset frequency The frequency setting can keep the number of selected characteristic words at about 10, and the specific number can be determined according to requirements.
  • the keyword tags given based on TF-IDF will be relatively objective, and the keyword tags are obtained based on statistical methods, thereby ensuring that the obtained keyword tags have the advantages of comprehensive consideration and low error rate. And the number of keyword tags given is in a controllable state, which can ensure that the keyword tags are richer.
  • the LDA topic model is a document topic generation model and a three-layer Bayesian probability model.
  • the model can extract the text topic-keyword distribution probability matrix from the official document (the topic-keyword distribution matrix is defined by the class The inter-dispersion matrix S B and the intra-class scatter matrix S W are calculated, where the topic-keyword distribution matrix is used as the feature matrix
  • the word embedding feature of the official document can be multiplied by the matrix W, and the text topic-keyword distribution probability matrix of the official document can be obtained, where,
  • the word embedding feature of the official document is the text feature after the word embedding of the official document using wordembedding.
  • Word embedding is a method of converting the keywords in the official document into a digital vector). Specifically, it is calculated by the LDA theme model that each keyword belongs to all The distribution probability of any topic in the topic, and the distribution probability is used as the selection probability. Among them, a selection probability represents the probability of a keyword associated with the text topic of the official document. Then, after comparing the selection probability with the preset probability through the LDA topic model, Finally, the text topics whose selection probability is greater than or equal to the preset probability and output by the LDA topic model are obtained; in the official document domain of this embodiment, the preset probability setting can keep the number of text topics filtered out at about 3. The quantity is determined according to demand. In this embodiment, the topic tags given based on the LDA topic model will be relatively objective, and the text topic corresponding to each keyword is obtained based on the model calculation method, thereby ensuring that the resulting text topic tags have comprehensive consideration and error rate Low advantage.
  • the attribute of the official document may represent the key attribute of the official document, where the attribute of the official document includes keyword tags, topic tags, mathematical entities, time of the official document and the unit of the document, etc.
  • the record data is the data corresponding to each official document formed for the database.
  • a record data can correspond to a type of official document, and the record data may include the overall content data in the official document;
  • Neo4j framework It is a high-performance NOSQL graph database that stores structured data on the network instead of data tables, and the Neo4j framework can also be regarded as a high-performance graph engine, so this embodiment can borrow the Neo4j framework to establish Set up an official document recommendation library about graph structures that record data and official document attributes.
  • the graph structure-based document recommendation library includes multiple graph structures, each graph structure can contain multiple nodes, and each graph structure can refer to at least An official document type of official document, such as the official document A as a node, the node is respectively associated with the node corresponding to the keyword tag "personnel transfer to a certain department", and the node corresponding to the topic tag "personnel change", and finally A diagram structure of personnel transfer to a certain department-official document A-personnel changes can be formed.
  • the work efficiency of the official document recommendation library containing multiple graph structures is better than that of a traditional database or a traditional search engine, and when too many official documents are stored in the official document recommendation library, the official documents of the official document recommendation library The recommended efficiency will not be affected.
  • S50 Receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the search content and the node.
  • SimRank is a model that measures the degree of similarity between any two objects based on the topological structure information of the graph. It can also be understood as a calculation method.
  • the model or the method can be embedded in the official document recommendation database.
  • the specific calculation process is to obtain the search content input by the user from the input interface in the official document recommendation library.
  • the search content can be multiple target keywords or official document names. At this time, the target keyword or official document in the search content Names, etc. are used as search nodes.
  • SimRank calculates the similarity between the search node and each node in the graph structure. For example, there are a total of 6 nodes in the official document A and the search node, of which 2 nodes are shared, and 4 nodes are similar.
  • the similarity between the retrieved content and the node is calculated by SimRank. Since SimRank combines the features in the text of a variety of official documents (the above-mentioned record data and the attributes of the official document), it can recommend relevant The target document with higher performance improves the accuracy and efficiency of recommendation, and the similarity between objects measured by SimRank is more in line with human intuitive judgment, and the order of the output target document is determined by the degree of similarity, which can improve User experience effect.
  • the method further includes:
  • the overall chapter structure refers to the various constituent structures of the official document, and the analysis
  • the result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
  • the constituent structure or/and the inconsistency that is missing in the official document are extracted from the official document.
  • the reasonable composition structure is marked with the missing composition structure or/and the unreasonable composition structure in the official document in a highlighted form, and the preset data recipient is asked to modify the official document.
  • the BERT model is a language representation model that can be used to analyze the overall chapter structure and length of the official document.
  • the specific training process of the BERT model is: first, it is necessary to train each sentence corresponding to the constituent structure of the official document to be labeled.
  • the sentences of the corresponding paragraphs in the overall chapter structure of the training text are labeled 1-B, 2-B, 3-B, 1-I, 2-I.3-I, 1 can represent the beginning, 2 can represent the discussion, and 3 can represent At the end, the beginning, the discussion, and the end are the composition structure of the official document.
  • the BERT model is modeled. Before the BERT model is trained, the existing word vectors in the BERT model can be performed according to the successfully marked sentences in the official document.
  • the BERT model is constantly fine-tuned to make the word vector distribution more reasonable (currently the pre-training word vector provided by the BERT model is trained based on all Chinese corpus, so the word vector distribution and the word vector in the official document application field are obtained The distribution is different, so the BERT model needs to be fine-tuned to suit the application field of the official document), and finally after all the word vectors are trained (so that the output of the BERT model can depict the essence of the language), the output of the BERT model can be selected [CLS ] Location ([CLS] location contains high-level feature vectors, containing the semantic information of the entire sentence) as the classification result of the composition structure classification of the official document (a category represents a composition structure) (this embodiment also outputs the BERT model The classification results are further revised.
  • the revision is to solve the jumping composition structure in the classification result.
  • the jumping structure such as 1-B, 1-I, 3-B, 2-B, 2-I, 3-B, and its representative
  • the correction method is mainly to adjust the position of each composition structure to adjust)
  • the output of the classification result is in the form of the probability corresponding to the composition structure category of different official documents
  • the classification results After comparing each probability with its preset threshold (mainly for the missing component structure in the official document), it can be determined whether the sentence corresponding to the component structure under the category is complete or/and reasonable.
  • the composition structure of the official document can be identified independently, and in the recognition process of the BERT model, the BERT model has the advantage of being convenient to use, and is not affected by the length of the official document, and the official document can be structured Disassembled, and the BERT model has strong generalization ability, which can target different types of official documents, and the analysis results of the BERT model output are obtained after analyzing the multi-dimensional composition structure of the entire official document, and pass the BERT model The output analysis results can also be used to further analyze the number of idioms used in each composition structure and the distribution of space.
  • the official document attribute further includes a digital entity; before the establishment of the official document recommendation database based on the graph structure according to the record data of the official document and the official document attribute through the Neo4j framework, the method further includes:
  • the target entity expression in the preset rule template is used to locate the target location of the digital entity.
  • the target entity expression generally has an association relationship with the digital entity, such as "the estimated total investment amount of the project is”, and
  • the capture rule expression in the preset rule template is used to capture a digital entity, for example, the digital entity is 10,000 yuan.
  • the digital entity extracted based on the preset rule template can improve the efficiency and effect of crawling.
  • the above method is also applicable to the capture of the time of the official document and the communication unit.
  • the attributes of the official document further include the time of the official document and the unit of the document; before the establishment of the official document recommendation database based on the graph structure based on the record data of the official document and the attribute of the official document through the Neo4j framework, the method further includes:
  • the generating of official document attributes according to the keyword tags and the topic tags includes:
  • the official document attribute is generated according to the time of the official document communication, the communication unit, the keyword tag, and the topic tag.
  • the NLP model is a natural language processing algorithm engine.
  • various required components can be identified, and then the content corresponding to the component components can be identified.
  • one component corresponds to one content, such as the official document corresponding to the time component mentioned above. The time of the communication and the corresponding unit of the communication.
  • an official document recommendation database based on a graph structure through the Neo4j framework according to the record data of the official document and the attribute of the official document includes:
  • the path sentences in the Neo4j framework are used to determine the paths of all the nodes according to the connection relationship, and the official document recommendation library based on the graph structure is established.
  • the create node statement is used to build a node, such as A node; the create relationship statement is used to build a connection relationship between nodes, such as A node-B node, etc.; the path statement is used to determine two nodes The entire path or the shortest path between nodes, such as A node-B node and A node-C node.
  • This embodiment mainly implements the establishment of an official document recommendation library based on a graph structure by using execution sentences in the Neo4j framework.
  • the method further includes:
  • this embodiment compresses the target official document into link points, where a link point can be stored in the entire official document content of a target official document. Therefore, this embodiment can be used to save the display resources of the target official document and avoid users Watching the target official document with too much official document content affects the user experience.
  • the above provides a method for recommending official documents based on graph structure.
  • the keyword tags given by TF-IDF will be relatively objective.
  • the keyword tags are obtained based on statistical methods to ensure that the key is obtained.
  • the word tag has the advantages of comprehensive consideration and low error rate, and the number of keyword tags given is controllable, which can ensure that the keyword tags are relatively rich;
  • the topic tags given based on the LDA topic model will be relatively objective, each key
  • the text topic corresponding to the word is obtained based on the model calculation method, which can ensure that the obtained text topic label has the advantages of comprehensive consideration and low error rate;
  • SimRank calculates the similarity between the search content entered by the user and the node, due to SimRank Combining the features in the text of a variety of official documents, it can recommend more relevant target official documents, improving the accuracy and efficiency of the recommendation; and the similarity between objects measured by SimRank is more in line with human intuitive judgment, and The level of the similarity determines the order of output target documents, which can improve the user experience
  • a device for recommending official documents based on a graph structure includes a first recording module 11, a second recording module 12, a first generating module 13, a establishing module 14, and a calculating module 15.
  • the detailed description of each functional module is as follows:
  • the first recording module 11 is configured to obtain multiple official documents with different types of official documents, determine the characteristic words in the obtained official documents based on the TF-IDF based on preset word statistical characteristics, and screen according to the TF-IDF the occurrence frequency is greater than or equal to Preset frequency characteristic words, and record the selected characteristic words as the keyword tags of the official document corresponding thereto;
  • the second recording module 12 is configured to input the official document into a preset LDA topic model, calculate the distribution probability matrix of the text topic-keywords in the official document through the LDA topic model, and then obtain the LDA topic model According to the text topic-keyword distribution probability matrix of the official document, select the text topic whose selection probability is greater than or equal to the preset probability, and record the selected text topic as the topic of the official document corresponding to it Label; the text topic-keyword distribution probability matrix contains a plurality of the selection probabilities, the selection probability refers to the probability that the keywords in the official document belong to the text theme of the official document;
  • the first generating module 13 is configured to generate official document attributes according to the keyword tags and the topic tags;
  • the establishment module 14 is configured to obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure according to the record data of the official document and the official document attribute through the Neo4j framework; the official document recommendation
  • the library contains multiple graph structures, one graph structure corresponds to the official document of at least one of the document types, and one graph structure contains multiple nodes connected to each other; one of the nodes represents the record data, One of the keyword tag and the topic tag;
  • the calculation module 15 is configured to receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the difference between the search content and the node Similarity.
  • the apparatus for recommending official documents based on the graph structure further includes:
  • the analysis module is used to analyze the overall chapter structure of the official document through the successfully trained BERT model to obtain an analysis result of the overall chapter structure of the official document;
  • the overall chapter structure refers to the various components of the official document Structure, the analysis result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
  • the marking module is used to extract the missing components from the official document when the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality
  • the structure or/and the unreasonable composition structure, the missing composition structure or/and the unreasonable composition structure in the official document are marked in the form of highlighting, and the preset data recipient is asked to refer to the official document to modify.
  • the apparatus for recommending official documents based on the graph structure further includes:
  • the grabbing module is used to locate the target position of the digital entity after searching for the digital entity of the official document through the target entity expression in the preset rule template, and express it through the grabbing rule in the preset rule template
  • the digital entity is captured from the target location in a manner.
  • the apparatus for recommending official documents based on the graph structure further includes:
  • the identification module is used to obtain the official document content of the official document, and identify the official document communication time corresponding to the time component and the communication unit corresponding to the unit component from the official document content through the NLP model;
  • the generating of official document attributes according to the keyword tags and the topic tags includes:
  • the second generating module is configured to generate the attributes of the official document according to the time of the official document communication, the unit of the communication, the keyword tag, and the topic tag.
  • the establishment module includes:
  • the first construction sub-module is used to construct each node corresponding to the official document according to the node attribute through the create node sentence in the Neo4j framework; the node attribute corresponds to the record data and the official document attribute respectively;
  • the second building sub-module is used to build a connection relationship between each of the nodes according to a preset relationship through the creation relationship statement in the Neo4j framework; the preset relationship corresponds to the record data and the official document attribute respectively;
  • the establishment sub-module is used to determine the paths of all the nodes according to the connection relationship through the path statements in the Neo4j framework, and establish and complete the official document recommendation library based on the graph structure.
  • the apparatus for recommending official documents based on the graph structure further includes:
  • the selection module is configured to compress the target official documents output in sequence according to the degree of similarity into link points, and when the user selects at least one of the link points, fully present the target document in a preset view.
  • the various modules in the above-mentioned apparatus for recommending official documents based on the graph structure can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 4.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
  • the database of the computer equipment is used to store the data involved in the method for recommending official documents based on the graph structure.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a method for recommending official documents based on a graph structure is realized.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage
  • one or more readable storage media storing computer readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the method for recommending official documents based on the graph structure in the above-mentioned embodiment Steps, for example, step S10 to step S50 shown in FIG. 2.
  • the processor executes the computer-readable instructions
  • the functions of the modules/units of the apparatus for recommending official documents based on the graph structure in the foregoing embodiment are implemented, for example, the functions of modules 11 to 15 shown in FIG. 3. To avoid repetition, I won’t repeat them here.
  • a computer-readable storage medium is provided, and computer-readable instructions are stored thereon.
  • the steps of the method for recommending official documents based on the graph structure in the foregoing embodiment are implemented, for example, Steps S10 to S50 shown in FIG. 2.
  • the functions of the modules/units of the apparatus for recommending official documents based on the graph structure in the foregoing embodiments are realized. To avoid repetition, I won’t repeat them here.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

The present application relates to the field of big data, and disclosed by the present application are a graph structure-based official document recommendation method, apparatus, computer device, and medium. The method comprises: obtaining a variety of official documents, filtering feature words according to TF-IDF, and recording the feature words as keyword tags of a corresponding official document; filtering out text topics whose probability of selection is greater than or equal to a preset probability by means of a text topic–keyword distribution probability matrix of the official document, and recording the selected text topics as subject tags of the corresponding official document; generating official document attributes according to the keyword tags and topic tags; obtaining the record data of the official document, and establishing an official document recommendation library based on a graph structure according to the document record data and document attributes by means of a Neo4j framework; receiving search content entered by a user from the official document recommendation library, and outputting the target official documents according to the height order of degree of similarity calculated by SimRank. The present application can recommend to the user a target official document which is most relevant to search content entered by the user.

Description

基于图结构的公文推荐方法、装置、计算机设备及介质Method, device, computer equipment and medium for recommending official documents based on graph structure
本申请要求于2020年5月29日提交中国专利局、申请号为202010475897.9,发明名称为“基于图结构的公文推荐方法、装置、计算机设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 29, 2020, the application number is 202010475897.9, and the invention title is "The method, device, computer equipment and medium for recommending official documents based on graph structure", and its entire content Incorporated in this application by reference.
技术领域Technical field
本申请涉及大数据领域的数据分析领域,尤其涉及一种基于图结构的公文推荐方法、装置、计算机设备及介质。This application relates to the field of data analysis in the field of big data, and in particular to a method, device, computer equipment, and medium for recommending official documents based on a graph structure.
背景技术Background technique
目前常用公文推荐方法多数基于传统的搜索引擎,传统搜索引擎在对公文进行推荐时,通常基于公文相似度进行推荐,如此,可以推荐与用户相关度较高的公文,但发明人意识到,在现有技术中,公文相似度的判定往往基于人工设定的单一标准进行确定,如此,由于人工设定的标准可能存在不准确的问题,因此将会导致传统搜索引擎在进行公文推荐时,会存在考虑不全面的问题,进而导致不能向用户推荐出与用户输入的内容相关度最高的公文,影响到用户体验效果。因此,本领域技术人员亟需寻找一种技术方案来解决上述提到的问题。At present, most commonly used official document recommendation methods are based on traditional search engines. When traditional search engines recommend official documents, they usually recommend official documents based on the similarity of the official documents. In this way, official documents with higher relevance to users can be recommended. However, the inventor realized that In the prior art, the determination of the similarity of official documents is often based on a single manually set standard. Therefore, the manually set standard may have inaccurate problems, which will cause the traditional search engine to make official document recommendation. There is a problem of incomplete consideration, which leads to the inability to recommend to the user the most relevant official document with the content input by the user, which affects the user experience effect. Therefore, those skilled in the art urgently need to find a technical solution to solve the aforementioned problems.
发明内容Summary of the invention
基于此,有必要针对上述技术问题,提供一种基于图结构的公文推荐方法、装置、计算机设备及介质,可向用户推荐出与用户输入的内容相关度最高的公文,进而提高用户体验效果。Based on this, it is necessary to address the above technical problems and provide a method, device, computer equipment, and medium for recommending official documents based on a graph structure, which can recommend to users the official documents that are most relevant to the content input by the user, thereby improving user experience.
一种基于图结构的公文推荐方法,包括:An official document recommendation method based on graph structure, including:
获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;Acquire a variety of official documents with different types of official documents, determine the characteristic words in the acquired documents according to TF-IDF based on preset word statistical characteristics, and filter characteristic words with a frequency greater than or equal to the preset frequency according to TF-IDF, and Recording the selected characteristic words as the keyword tags of the official document corresponding thereto;
将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;Input the official document into the preset LDA topic model, calculate the distribution probability matrix of the text topic-keyword in the official document through the LDA topic model, and then obtain the LDA topic model according to the text topic of the official document. The distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key The distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
根据所述关键词标签和所述主题标签生成公文属性;Generating official document attributes according to the keyword tags and the topic tags;
根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;Obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure based on the record data of the official document and the document attribute through the Neo4j framework; the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
一种基于图结构的公文推荐装置,包括:An official document recommendation device based on graph structure, including:
第一记录模块,用于获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词 语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;The first recording module is used to obtain a variety of official documents with different types of official documents, determine the characteristic words in the obtained official documents according to TF-IDF based on preset word statistical characteristics, and filter according to the TF-IDF the occurrence frequency is greater than or equal to the expected Set frequency characteristic words, and record the selected characteristic words as the corresponding keyword tags of the official document;
第二记录模块,用于将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;The second recording module is used to input the official document into a preset LDA topic model, calculate the distribution probability matrix of the text topic-keywords in the official document through the LDA topic model, and then obtain the LDA topic model according to The text topic-keyword distribution probability matrix of the official document selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them The distribution probability matrix of the text topic-keyword includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text theme of the official document;
第一生成模块,用于根据所述关键词标签和所述主题标签生成公文属性;The first generation module is configured to generate official document attributes according to the keyword tags and the topic tags;
建立模块,用于根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;The establishment module is used to obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure according to the record data of the official document and the attribute of the official document through the Neo4j framework; the official document recommendation database Contains multiple graph structures, one graph structure corresponds to the official document of at least one of the document types, and one graph structure contains multiple nodes connected to each other; one of the nodes represents the record data, all One of the keyword tag and the topic tag;
计算模块,用于接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The calculation module is configured to receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the search content and the node degree.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions:
获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;Acquire a variety of official documents with different types of official documents, determine the characteristic words in the acquired documents according to TF-IDF based on preset word statistical characteristics, and filter characteristic words with a frequency greater than or equal to the preset frequency according to TF-IDF, and Recording the selected characteristic words as the keyword tags of the official document corresponding thereto;
将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;Input the official document into the preset LDA topic model, calculate the distribution probability matrix of the text topic-keyword in the official document through the LDA topic model, and then obtain the LDA topic model according to the text topic of the official document. The distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key The distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
根据所述关键词标签和所述主题标签生成公文属性;Generating official document attributes according to the keyword tags and the topic tags;
根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;Obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure based on the record data of the official document and the document attribute through the Neo4j framework; the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
一个或多个存储有计算机可读指令的可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;Acquire a variety of official documents with different types of official documents, determine the characteristic words in the acquired documents according to TF-IDF based on preset word statistical characteristics, and filter characteristic words with a frequency greater than or equal to the preset frequency according to TF-IDF, and Recording the selected characteristic words as the keyword tags of the official document corresponding thereto;
将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选 出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;Input the official document into the preset LDA topic model, calculate the distribution probability matrix of the text topic-keyword in the official document through the LDA topic model, and then obtain the LDA topic model according to the text topic of the official document. The distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key The distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
根据所述关键词标签和所述主题标签生成公文属性;Generating official document attributes according to the keyword tags and the topic tags;
根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;Obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure based on the record data of the official document and the document attribute through the Neo4j framework; the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
上述基于图结构的公文推荐方法、装置、计算机设备及介质,基于TF-IDF给出的关键词标签会相对比较客观,关键词标签是基于统计学方法而得到,进而能保证得到的关键词标签具有考虑全面和错误率低的优点,且给出的关键词标签数量为可控制状态,可保证关键词标签较为丰富;基于LDA主题模型给出的主题标签会相对比较客观,每一个关键词所对应的文本主题基于模型运算方法而得到,进而能保证得到的文本主题标签具有考虑全面和错误率低的优点;通过SimRank计算出用户输入的检索内容与节点之间的相似度,由于SimRank结合了多种公文的文本内的特征,因此可推荐出相关性较高的目标公文,提升推荐的准确度和效率,SimRank度量出的对象之间相似性更加符合人类的直觉判断,且以该相似度的高低去确定输出的目标公文的顺序,可提高用户的体验效果。The above-mentioned method, device, computer equipment and media for document recommendation based on graph structure, keyword tags based on TF-IDF will be relatively objective, and keyword tags are obtained based on statistical methods, which can guarantee the obtained keyword tags It has the advantages of comprehensive consideration and low error rate, and the number of keyword tags given is in a controllable state, which can ensure that the keyword tags are relatively rich; the topic tags given based on the LDA topic model will be relatively objective, and each keyword is The corresponding text topic is obtained based on the model calculation method, thereby ensuring that the obtained text topic label has the advantages of comprehensive consideration and low error rate; SimRank calculates the similarity between the search content entered by the user and the node, because SimRank combines The features in the text of a variety of official documents can therefore recommend more relevant target official documents to improve the accuracy and efficiency of the recommendation. The similarity between objects measured by SimRank is more in line with human intuitive judgment, and the similarity is To determine the order of the output target documents, the user experience can be improved.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中基于图结构的公文推荐方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of an official document recommendation method based on a graph structure in an embodiment of the present application;
图2是本申请一实施例中基于图结构的公文推荐方法的一流程图;2 is a flowchart of a method for recommending official documents based on a graph structure in an embodiment of the present application;
图3是本申请一实施例中基于图结构的公文推荐装置的结构示意图;3 is a schematic diagram of the structure of an official document recommendation device based on a graph structure in an embodiment of the present application;
图4是本申请一实施例中计算机设备的一示意图。Fig. 4 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请提供的基于图结构的公文推荐方法,可应用在如图1的应用环境中,其中,客户端通过网络与服务器进行通信。其中,客户端可以但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备。服务器可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The method for recommending official documents based on the graph structure provided in this application can be applied to the application environment as shown in FIG. 1, in which the client communicates with the server through the network. Among them, the client can be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented as an independent server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种基于图结构的公文推荐方法,以该方法应用在图1中的服务器为例进行说明,包括如下步骤:In an embodiment, as shown in FIG. 2, a method for recommending an official document based on a graph structure is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:
S10,获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;S10. Acquire multiple official documents with different official document types, determine the acquired characteristic words in the official document based on preset word statistical characteristics according to TF-IDF, and filter characteristic words with occurrence frequency greater than or equal to the preset frequency according to TF-IDF , And record the selected characteristic words as the corresponding keyword tags of the official document;
可理解地,公文为目前至少15种公文类型的公文,该公文类型包括但不限于命令、决定、公告、通告和通知等;TF-IDF(Term frequency–Inverse Document Frequency)是一种用于信息检索与数据挖掘的常用加权技术,可作为关键词提取手段,其中,TF指特征词语出现的频率,IDF指特征词语在其他公文中出现的频率;特征词语一般代指能代表本公文的公文内容的词语,像人称代词、语气助词和连接词一般不列入特征词语中,而像公文中的具体的执行动词可列入至特征词语中,具体可根据需求设置词语统计特征进而来决定需从公文获取的特征词语,因此本实施例中的词语统计特征是包括多种特征词语的特征,如特征词语中的执行动词对应的动词特征;特征词语的出现频率越高,则可说明该特征词语在公文中的代表性和重要性很高,可选地,预设频率可根据应用领域来设置,但由于部分特征词语会偏向于某个应用领域,因此在本实施例的公文领域中,预设频率的设置可令筛选出的特征词语的数量保持在10个左右,具体数量可按需求决定。在本实施例中,基于TF-IDF给出的关键词标签会相对比较客观,关键词标签是基于统计学方法而得到,进而能保证得到的关键词标签具有考虑全面和错误率低的优点,且给出的关键词标签数量为可控制状态,可保证关键词标签较为丰富。Understandably, official documents are currently at least 15 official document types, which include but are not limited to commands, decisions, announcements, announcements, and notifications; TF-IDF (Term frequency—Inverse Document Frequency) is a type of information Commonly used weighting techniques for retrieval and data mining can be used as a means of keyword extraction. Among them, TF refers to the frequency of feature words, IDF refers to the frequency of feature words in other official documents; feature words generally refer to the content of the official document that can represent the official document Words such as personal pronouns, modal auxiliary words and conjunctions are generally not included in the characteristic words, while specific executive verbs in official documents can be included in the characteristic words. Specifically, the statistical characteristics of the words can be set according to the needs to determine the The characteristic words obtained from the official document, therefore, the statistical characteristics of the words in this embodiment include the characteristics of a variety of characteristic words, such as the verb characteristics corresponding to the executive verbs in the characteristic words; the higher the frequency of occurrence of the characteristic words, the characteristic words can be explained. The representativeness and importance in the official document is very high. Optionally, the preset frequency can be set according to the application field. However, because some characteristic words are biased toward a certain application field, in the official document field of this embodiment, the preset frequency The frequency setting can keep the number of selected characteristic words at about 10, and the specific number can be determined according to requirements. In this embodiment, the keyword tags given based on TF-IDF will be relatively objective, and the keyword tags are obtained based on statistical methods, thereby ensuring that the obtained keyword tags have the advantages of comprehensive consideration and low error rate. And the number of keyword tags given is in a controllable state, which can ensure that the keyword tags are richer.
S20,将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;S20. Input the official document into a preset LDA topic model, calculate the distribution probability matrix of the text topic-keywords in the official document through the LDA topic model, and then obtain the LDA topic model according to the text of the official document The topic-keyword distribution probability matrix filters out the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding thereto; the text topics -The distribution probability matrix of keywords includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
可理解地,LDA主题模型是一种文档主题生成模型,也是一个三层贝叶斯概率模型,该模型可从公文中提取到文本主题-关键词分布概率矩阵(主题-关键词分布矩阵由类间散布矩阵S B和类内散布矩阵S W计算得到,其中,该主题-关键词分布矩阵作为特征矩阵
Figure PCTCN2020116744-appb-000001
当有公文输入字预设的LDA主题模型,在预设的LDA主题模型中,可将公文的词嵌入特征与矩阵W相乘,并可得到公文的文本主题-关键词分布概率矩阵,其中,公文的词嵌入特征是利用wordembedding对公文进行词嵌入后的篇章特征,词嵌入是一种将公文中的关键词转换成数字向量的方法),具体是通过LDA主题模型计算每一个关键词属于所有主题中任意一个主题的分布概率,并将该分布概率作为选取概率,其中,一个选取概率代表一个关键词与公文的文本主题关联的概率,接着通过LDA主题模型对比选取概率跟预设概率后,最后得到选取概率大于或等于预设概率且由LDA主题模型输出的文本主题;在本实施例的公文领域中,预设概率的设置可令筛选出的文本主题的数量保持在3个左右,具体数量按需求决定。在本实施例中,基于LDA主题模型给出的主题标签会相对比较客观,每一个关键词所对应的文本主题基于模型运算方法而得到,进而能保证得到的文本主题标签具有考虑全面和错误率低的优点。
Understandably, the LDA topic model is a document topic generation model and a three-layer Bayesian probability model. The model can extract the text topic-keyword distribution probability matrix from the official document (the topic-keyword distribution matrix is defined by the class The inter-dispersion matrix S B and the intra-class scatter matrix S W are calculated, where the topic-keyword distribution matrix is used as the feature matrix
Figure PCTCN2020116744-appb-000001
When there is a preset LDA topic model for official document input characters, in the preset LDA topic model, the word embedding feature of the official document can be multiplied by the matrix W, and the text topic-keyword distribution probability matrix of the official document can be obtained, where, The word embedding feature of the official document is the text feature after the word embedding of the official document using wordembedding. Word embedding is a method of converting the keywords in the official document into a digital vector). Specifically, it is calculated by the LDA theme model that each keyword belongs to all The distribution probability of any topic in the topic, and the distribution probability is used as the selection probability. Among them, a selection probability represents the probability of a keyword associated with the text topic of the official document. Then, after comparing the selection probability with the preset probability through the LDA topic model, Finally, the text topics whose selection probability is greater than or equal to the preset probability and output by the LDA topic model are obtained; in the official document domain of this embodiment, the preset probability setting can keep the number of text topics filtered out at about 3. The quantity is determined according to demand. In this embodiment, the topic tags given based on the LDA topic model will be relatively objective, and the text topic corresponding to each keyword is obtained based on the model calculation method, thereby ensuring that the resulting text topic tags have comprehensive consideration and error rate Low advantage.
S30,根据所述关键词标签和所述主题标签生成公文属性;S30, generating an official document attribute according to the keyword tag and the topic tag;
可理解地,公文属性可代表公文的关键属性,其中该公文属性包括关键词标签、主题标签、数学实体、公文来文时间和来文单位等。Understandably, the attribute of the official document may represent the key attribute of the official document, where the attribute of the official document includes keyword tags, topic tags, mathematical entities, time of the official document and the unit of the document, etc.
S40,根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包 含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;S40. Obtain record data of the official document according to each type of official document, and establish an official document recommendation database based on a graph structure based on the record data of the official document and the attribute of the official document through the Neo4j framework; the official document recommendation database contains multiple A graph structure, one graph structure corresponds to at least one of the official document types, and one graph structure contains multiple nodes connected to each other; one of the nodes represents the record data and the keywords One of the label and the subject label;
可理解地,记录数据是针对数据库而形成的与每一篇公文所对应的数据,其中,一条记录数据可对应一种公文类型的公文,该记录数据可包括公文中的整体内容数据;Neo4j框架是一个高性能的NOSQL图形数据库,它将结构化数据存储在网络上而不是数据表中,且Neo4j框架也可以被看作是一个高性能的图引擎,因此本实施例可借用该Neo4j框架建立起一个关于记录数据和公文属性的图结构的公文推荐库,其中,该基于图结构的公文推荐库包含多个图结构,每一个图结构可包含多个节点,且每一个图结构可指至少一种公文类型的公文,如将公文A作为一个节点,该节点分别与关键词标签为“人事调动至某个部门”对应的节点,与主题标签为“人事变动”对应的节点相互关联,最后可形成人事调动至某个部门-公文A-人事变动的图结构。在本实施例中,包含多个图结构的公文推荐库的工作效率优于传统数据库或传统的搜索引擎的工作效率,且在公文推荐库中存储过多的公文时,该公文推荐库的公文推荐效率并不会受到影响。Understandably, the record data is the data corresponding to each official document formed for the database. Among them, a record data can correspond to a type of official document, and the record data may include the overall content data in the official document; Neo4j framework It is a high-performance NOSQL graph database that stores structured data on the network instead of data tables, and the Neo4j framework can also be regarded as a high-performance graph engine, so this embodiment can borrow the Neo4j framework to establish Set up an official document recommendation library about graph structures that record data and official document attributes. The graph structure-based document recommendation library includes multiple graph structures, each graph structure can contain multiple nodes, and each graph structure can refer to at least An official document type of official document, such as the official document A as a node, the node is respectively associated with the node corresponding to the keyword tag "personnel transfer to a certain department", and the node corresponding to the topic tag "personnel change", and finally A diagram structure of personnel transfer to a certain department-official document A-personnel changes can be formed. In this embodiment, the work efficiency of the official document recommendation library containing multiple graph structures is better than that of a traditional database or a traditional search engine, and when too many official documents are stored in the official document recommendation library, the official documents of the official document recommendation library The recommended efficiency will not be affected.
S50,接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。S50: Receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the search content and the node.
可理解地,SimRank是一种基于图的拓扑结构信息来衡量任意两个对象间相似程度的模型,在此也可理解成一种计算方法,该模型或该方法是可嵌入至公文推荐库中,具体的计算过程为,获取用户自公文推荐库中的输入接口输入的检索内容,该检索内容可为多个目标关键词或公文名称等,此时,将该检索内容中的目标关键词或公文名称等都作为检索节点,通过SimRank计算出检索节点与图结构中各个节点之间的相似度,比如,公文A和检索节点总共有6个节点,其中2个节点共有,4个节点相似,此时,相似度为4/6=0.67。在本实施例中,通过SimRank计算出检索内容与节点之间的相似度,由于SimRank结合了多种公文的文本内的特征(上述提到的记录数据和述公文属性),因此可推荐出相关性较高的目标公文,提升推荐的准确度和效率,且SimRank度量出的对象之间相似性更加符合人类的直觉判断,且以该相似度的高低去确定输出的目标公文的顺序,可提高用户的体验效果。Understandably, SimRank is a model that measures the degree of similarity between any two objects based on the topological structure information of the graph. It can also be understood as a calculation method. The model or the method can be embedded in the official document recommendation database. The specific calculation process is to obtain the search content input by the user from the input interface in the official document recommendation library. The search content can be multiple target keywords or official document names. At this time, the target keyword or official document in the search content Names, etc. are used as search nodes. SimRank calculates the similarity between the search node and each node in the graph structure. For example, there are a total of 6 nodes in the official document A and the search node, of which 2 nodes are shared, and 4 nodes are similar. When, the similarity is 4/6=0.67. In this embodiment, the similarity between the retrieved content and the node is calculated by SimRank. Since SimRank combines the features in the text of a variety of official documents (the above-mentioned record data and the attributes of the official document), it can recommend relevant The target document with higher performance improves the accuracy and efficiency of recommendation, and the similarity between objects measured by SimRank is more in line with human intuitive judgment, and the order of the output target document is determined by the degree of similarity, which can improve User experience effect.
进一步地,所述获取具有不同公文类型的多种公文之前,还包括:Further, before the acquiring multiple types of official documents with different types of official documents, the method further includes:
通过已训练成功的BERT模型对所述公文的整体篇章结构进行分析,得到一个对所述公文的整体篇章结构的分析结果;所述整体篇章结构是指所述公文的各个组成结构,所述分析结果是对所述公文的各个所述组成结构的完整性和合理性进行判断的结果;Analyze the overall chapter structure of the official document through the successfully trained BERT model, and obtain an analysis result of the overall chapter structure of the official document; the overall chapter structure refers to the various constituent structures of the official document, and the analysis The result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
在所述分析结果为所述公文的其中一个组成结构不具备所述完整性或/和所述合理性时,从所述公文中提取出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,以突出显示的形式标注出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,并令预设数据接收方对该公文进行修改。When the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality, the constituent structure or/and the inconsistency that is missing in the official document are extracted from the official document. The reasonable composition structure is marked with the missing composition structure or/and the unreasonable composition structure in the official document in a highlighted form, and the preset data recipient is asked to modify the official document.
可理解地,BERT模型是一种可用于对公文的整体篇章结构和篇幅进行分析的语言表征模型,该BERT模型具体的训练过程为:首先需要训练公文中的组成结构对应的各个句子进行标注,如为训练文本整体篇章结构中相应段落的句子标注1-B,2-B,3-B,1-I,2-I.3-I,1可代表开头,2可代表论述,3可代表结尾,其中,开头、论述和结尾都为公文的组成结构,接着对BERT模型进行建模,且在对BERT模型训练之前可根据公文中已标注成功的句子对BERT模型中已有的词向量进行增强训练,以令词向量表征的分布更贴合于公文(在公文中的已标注的句子存在数量不足的现象,可跳过此增强训练),且在对BERT模型训练时可通过bert-base的基础上对BERT模型不断进行微调以令词向量分布更加合理(目前BERT模型提供的预训练词向量是基于所有的中文语料进行训练的,因此得到的词向量分布和公文应用领域下的词向量分布有所区别,因此需对BERT模型进行微调以适应该公文应用领域),最后对所有的词向量训练完成后(使得BERT模型输出能够刻画出语言的本 质),可选取BERT模型输出的[CLS]位置([CLS]位置所包含的是高位特征向量,包含有整句的语义信息)作为公文的组成结构分类(一种类别代表一个组成结构)的分类结果(本实施例也对BERT模型输出的分类结果进行进一步地修正,修正是为了解决分类结果中存在跳跃的组成结构,跳跃结构如1-B,1-I,3-B,2-B,2-I,3-B,其代表为开头-开头-结尾-论述-论述-结尾,修正的手段主要是调整各个组成结构的位置进行调整),该分类结果输出的形式为不同公文的组成结构类别对应的概率,将分类结果中的各个概率与其预设阈值(主要是针对与公文中缺失的组成结构)作对比后,就可确定出该类别下的组成结构对应的句子是否具备完整性或/和合理性。在本实施例中,基于BERT模型可实现自主识别公文的组成结构,且在该BERT模型的识别过程中,该BERT模型存在使用方便的优点,不受公文篇幅长度的影响,可对公文进行结构拆解,且该BERT模型泛化能力强,可针对不同公文类型的公文,且该BERT模型输出的分析结果是在对整篇公文的多维度的组成结构进行分析后得到,且通过该BERT模型输出的分析结果还可以对各个组成结构中的成语使用数量和篇幅分布做进一步地分析。Understandably, the BERT model is a language representation model that can be used to analyze the overall chapter structure and length of the official document. The specific training process of the BERT model is: first, it is necessary to train each sentence corresponding to the constituent structure of the official document to be labeled. For example, the sentences of the corresponding paragraphs in the overall chapter structure of the training text are labeled 1-B, 2-B, 3-B, 1-I, 2-I.3-I, 1 can represent the beginning, 2 can represent the discussion, and 3 can represent At the end, the beginning, the discussion, and the end are the composition structure of the official document. Then the BERT model is modeled. Before the BERT model is trained, the existing word vectors in the BERT model can be performed according to the successfully marked sentences in the official document. Enhance training to make the distribution of word vector representation more suitable for official documents (there is insufficient number of marked sentences in official documents, you can skip this enhanced training), and you can pass bert-base when training the BERT model On the basis of BERT, the BERT model is constantly fine-tuned to make the word vector distribution more reasonable (currently the pre-training word vector provided by the BERT model is trained based on all Chinese corpus, so the word vector distribution and the word vector in the official document application field are obtained The distribution is different, so the BERT model needs to be fine-tuned to suit the application field of the official document), and finally after all the word vectors are trained (so that the output of the BERT model can depict the essence of the language), the output of the BERT model can be selected [CLS ] Location ([CLS] location contains high-level feature vectors, containing the semantic information of the entire sentence) as the classification result of the composition structure classification of the official document (a category represents a composition structure) (this embodiment also outputs the BERT model The classification results are further revised. The revision is to solve the jumping composition structure in the classification result. The jumping structure such as 1-B, 1-I, 3-B, 2-B, 2-I, 3-B, and its representative For beginning-beginning-end-discussion-discussion-end, the correction method is mainly to adjust the position of each composition structure to adjust), the output of the classification result is in the form of the probability corresponding to the composition structure category of different official documents, and the classification results After comparing each probability with its preset threshold (mainly for the missing component structure in the official document), it can be determined whether the sentence corresponding to the component structure under the category is complete or/and reasonable. In this embodiment, based on the BERT model, the composition structure of the official document can be identified independently, and in the recognition process of the BERT model, the BERT model has the advantage of being convenient to use, and is not affected by the length of the official document, and the official document can be structured Disassembled, and the BERT model has strong generalization ability, which can target different types of official documents, and the analysis results of the BERT model output are obtained after analyzing the multi-dimensional composition structure of the entire official document, and pass the BERT model The output analysis results can also be used to further analyze the number of idioms used in each composition structure and the distribution of space.
进一步地,所述公文属性还包括数字实体;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,还包括:Further, the official document attribute further includes a digital entity; before the establishment of the official document recommendation database based on the graph structure according to the record data of the official document and the official document attribute through the Neo4j framework, the method further includes:
通过预设规则模板中的目标实体表达式对所述公文进行数字实体搜索后定位出所述数字实体的目标位置,并通过所述预设规则模板中的抓取规则表达式从所述目标位置抓取所述数字实体。After performing a digital entity search on the official document through the target entity expression in the preset rule template, locate the target location of the digital entity, and use the capture regular expression in the preset rule template to find the target location of the digital entity. Grab the digital entity.
可理解地,预设规则模板中存在的目标实体表达式是用于定位数字实体的目标位置,该目标实体表达式一般与数字实体存在关联关系,比如“该项目预计投资总金额为”,而预设规则模板中存在的抓取规则表达式是用于抓取数字实体,比如该数字实体为10000元。在本实施例中,基于预设规则模板抽取的数字实体可提高抓取效率和效果。上述方法同样适用于公文来文时间和来文单位的抓取。Understandably, the target entity expression in the preset rule template is used to locate the target location of the digital entity. The target entity expression generally has an association relationship with the digital entity, such as "the estimated total investment amount of the project is", and The capture rule expression in the preset rule template is used to capture a digital entity, for example, the digital entity is 10,000 yuan. In this embodiment, the digital entity extracted based on the preset rule template can improve the efficiency and effect of crawling. The above method is also applicable to the capture of the time of the official document and the communication unit.
进一步地,所述公文属性还包括公文来文时间和来文单位;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,还包括:Further, the attributes of the official document further include the time of the official document and the unit of the document; before the establishment of the official document recommendation database based on the graph structure based on the record data of the official document and the attribute of the official document through the Neo4j framework, the method further includes:
获取所述公文的公文内容,通过NLP模型从所述公文内容中识别出与时间组成成分对应的所述公文来文时间以及与单位组成成分对应的所述来文单位;Obtain the official document content of the official document, and identify the official document communication time corresponding to the time component and the communication unit corresponding to the unit component from the official document content through the NLP model;
所述根据所述关键词标签和所述主题标签生成公文属性,包括:The generating of official document attributes according to the keyword tags and the topic tags includes:
根据所述公文来文时间、所述来文单位、所述关键词标签和所述主题标签生成所述公文属性。The official document attribute is generated according to the time of the official document communication, the communication unit, the keyword tag, and the topic tag.
可理解地,NLP模型是一种自然语言处理算法引擎。本实施例基于NLP模型能识别出各种所需的组成成分,进而识别到与组成成分对应的内容,其中,一种组成成分对应一种内容,如上述提到的与时间组成成分对应的公文来文时间以及与单位组成成分对应的来文单位。Understandably, the NLP model is a natural language processing algorithm engine. In this embodiment, based on the NLP model, various required components can be identified, and then the content corresponding to the component components can be identified. Among them, one component corresponds to one content, such as the official document corresponding to the time component mentioned above. The time of the communication and the corresponding unit of the communication.
进一步地,所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库,包括:Further, the establishment of an official document recommendation database based on a graph structure through the Neo4j framework according to the record data of the official document and the attribute of the official document includes:
通过Neo4j框架中的创建节点语句依据节点属性搭建出与所述公文对应的各个节点;所述节点属性与所述记录数据和所述公文属性分别对应;Construct each node corresponding to the official document according to the node attribute through the create node sentence in the Neo4j framework; the node attribute corresponds to the record data and the official document attribute respectively;
通过Neo4j框架中的创建关系语句依据预设关系搭建出各个所述节点之间的连接关系;所述预设关系与所述记录数据和所述公文属性分别对应;Construct a connection relationship between each of the nodes according to a preset relationship through the creation relationship statement in the Neo4j framework; the preset relationship corresponds to the record data and the official document attribute respectively;
通过Neo4j框架中的路径语句依据所述连接关系确定出所有所述节点的路径,建立完成基于图结构的所述公文推荐库。The path sentences in the Neo4j framework are used to determine the paths of all the nodes according to the connection relationship, and the official document recommendation library based on the graph structure is established.
可理解地,该创建节点语句用于搭建出节点,如A节点;该创建关系语句用于搭建除节点之间的连接关系,如A节点-B节点等;该路径语句用于确定两个节点之间的全部路径或最短路径,如A节点-B节点和A节点-C节点。本实施例主要是通过在Neo4j框架中 运用执行语句以实现基于图结构的公文推荐库的建立。Understandably, the create node statement is used to build a node, such as A node; the create relationship statement is used to build a connection relationship between nodes, such as A node-B node, etc.; the path statement is used to determine two nodes The entire path or the shortest path between nodes, such as A node-B node and A node-C node. This embodiment mainly implements the establishment of an official document recommendation library based on a graph structure by using execution sentences in the Neo4j framework.
进一步地,所述依据SimRank计算出的相似度的高低次序输出目标公文之后,还包括:Further, after the output of the target official document according to the high and low order of the similarity calculated by SimRank, the method further includes:
将按照所述相似度的高低依次输出的所述目标公文压缩至链接点中,并在所述用户选择至少一个所述链接点时,以预设视图形式完整呈现出所述链接点中对应的所述目标公文的公文内容;一个所述链接点分别与一篇所述目标公文对应。Compress the target official documents output in sequence according to the degree of similarity into link points, and when the user selects at least one of the link points, the corresponding ones of the link points are completely presented in the form of a preset view The official document content of the target official document; one of the link points corresponds to a piece of the target official document.
可理解地,本实施例将目标公文压缩至链接点中,其中,一个链接点可存放于一篇目标公文全部的公文内容,因此本实施例可用于节省目标公文的展示资源,并可避免用户观看过多公文内容的目标公文而影响到用户的体验效果。Understandably, this embodiment compresses the target official document into link points, where a link point can be stored in the entire official document content of a target official document. Therefore, this embodiment can be used to save the display resources of the target official document and avoid users Watching the target official document with too much official document content affects the user experience.
综上所述,上述提供了一种基于图结构的公文推荐方法,基于TF-IDF给出的关键词标签会相对比较客观,关键词标签是基于统计学方法而得到,进而能保证得到的关键词标签具有考虑全面和错误率低的优点,且给出的关键词标签数量为可控制状态,可保证关键词标签较为丰富;基于LDA主题模型给出的主题标签会相对比较客观,每一个关键词所对应的文本主题基于模型运算方法而得到,进而能保证得到的文本主题标签具有考虑全面和错误率低的优点;通过SimRank计算出用户输入的检索内容与节点之间的相似度,由于SimRank结合了多种公文的文本内的特征,因此可推荐出相关性较高的目标公文,提升推荐的准确度和效率;且SimRank度量出的对象之间相似性更加符合人类的直觉判断,且以该相似度的高低去确定输出的目标公文的顺序,可提高用户的体验效果。In summary, the above provides a method for recommending official documents based on graph structure. The keyword tags given by TF-IDF will be relatively objective. The keyword tags are obtained based on statistical methods to ensure that the key is obtained. The word tag has the advantages of comprehensive consideration and low error rate, and the number of keyword tags given is controllable, which can ensure that the keyword tags are relatively rich; the topic tags given based on the LDA topic model will be relatively objective, each key The text topic corresponding to the word is obtained based on the model calculation method, which can ensure that the obtained text topic label has the advantages of comprehensive consideration and low error rate; SimRank calculates the similarity between the search content entered by the user and the node, due to SimRank Combining the features in the text of a variety of official documents, it can recommend more relevant target official documents, improving the accuracy and efficiency of the recommendation; and the similarity between objects measured by SimRank is more in line with human intuitive judgment, and The level of the similarity determines the order of output target documents, which can improve the user experience.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution, and the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
在一实施例中,提供一种基于图结构的公文推荐装置,该基于图结构的公文推荐装置与上述实施例中基于图结构的公文推荐方法一一对应。如图3所示,该基于图结构的公文推荐装置包括第一记录模块11、第二记录模块12、第一生成模块13、建立模块14和计算模块15。各功能模块详细说明如下:In one embodiment, a device for recommending official documents based on a graph structure is provided, and the device for recommending official documents based on a graph structure corresponds to the method for recommending official documents based on a graph structure in the foregoing embodiment. As shown in FIG. 3, the apparatus for recommending official documents based on a graph structure includes a first recording module 11, a second recording module 12, a first generating module 13, a establishing module 14, and a calculating module 15. The detailed description of each functional module is as follows:
第一记录模块11,用于获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;The first recording module 11 is configured to obtain multiple official documents with different types of official documents, determine the characteristic words in the obtained official documents based on the TF-IDF based on preset word statistical characteristics, and screen according to the TF-IDF the occurrence frequency is greater than or equal to Preset frequency characteristic words, and record the selected characteristic words as the keyword tags of the official document corresponding thereto;
第二记录模块12,用于将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;The second recording module 12 is configured to input the official document into a preset LDA topic model, calculate the distribution probability matrix of the text topic-keywords in the official document through the LDA topic model, and then obtain the LDA topic model According to the text topic-keyword distribution probability matrix of the official document, select the text topic whose selection probability is greater than or equal to the preset probability, and record the selected text topic as the topic of the official document corresponding to it Label; the text topic-keyword distribution probability matrix contains a plurality of the selection probabilities, the selection probability refers to the probability that the keywords in the official document belong to the text theme of the official document;
第一生成模块13,用于根据所述关键词标签和所述主题标签生成公文属性;The first generating module 13 is configured to generate official document attributes according to the keyword tags and the topic tags;
建立模块14,用于根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;The establishment module 14 is configured to obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure according to the record data of the official document and the official document attribute through the Neo4j framework; the official document recommendation The library contains multiple graph structures, one graph structure corresponds to the official document of at least one of the document types, and one graph structure contains multiple nodes connected to each other; one of the nodes represents the record data, One of the keyword tag and the topic tag;
计算模块15,用于接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The calculation module 15 is configured to receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the difference between the search content and the node Similarity.
进一步地,所述基于图结构的公文推荐装置还包括:Further, the apparatus for recommending official documents based on the graph structure further includes:
分析模块,用于通过已训练成功的BERT模型对所述公文的整体篇章结构进行分析,得到一个对所述公文的整体篇章结构的分析结果;所述整体篇章结构是指所述公文的各个组成结构,所述分析结果是对所述公文的各个所述组成结构的完整性和合理性进行判断的结果;The analysis module is used to analyze the overall chapter structure of the official document through the successfully trained BERT model to obtain an analysis result of the overall chapter structure of the official document; the overall chapter structure refers to the various components of the official document Structure, the analysis result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
标注模块,用于在所述分析结果为所述公文的其中一个组成结构不具备所述完整性或/和所述合理性时,从所述公文中提取出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,以突出显示的形式标注出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,并令预设数据接收方对该公文进行修改。The marking module is used to extract the missing components from the official document when the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality The structure or/and the unreasonable composition structure, the missing composition structure or/and the unreasonable composition structure in the official document are marked in the form of highlighting, and the preset data recipient is asked to refer to the official document to modify.
进一步地,所述基于图结构的公文推荐装置还包括:Further, the apparatus for recommending official documents based on the graph structure further includes:
抓取模块,用于通过预设规则模板中的目标实体表达式对所述公文进行数字实体搜索后定位出所述数字实体的目标位置,并通过所述预设规则模板中的抓取规则表达式从所述目标位置抓取所述数字实体。The grabbing module is used to locate the target position of the digital entity after searching for the digital entity of the official document through the target entity expression in the preset rule template, and express it through the grabbing rule in the preset rule template The digital entity is captured from the target location in a manner.
进一步地,所述基于图结构的公文推荐装置还包括:Further, the apparatus for recommending official documents based on the graph structure further includes:
识别模块,用于获取所述公文的公文内容,通过NLP模型从所述公文内容中识别出与时间组成成分对应的所述公文来文时间以及与单位组成成分对应的所述来文单位;The identification module is used to obtain the official document content of the official document, and identify the official document communication time corresponding to the time component and the communication unit corresponding to the unit component from the official document content through the NLP model;
所述根据所述关键词标签和所述主题标签生成公文属性,包括:The generating of official document attributes according to the keyword tags and the topic tags includes:
第二生成模块,用于根据所述公文来文时间、所述来文单位、所述关键词标签和所述主题标签生成所述公文属性。The second generating module is configured to generate the attributes of the official document according to the time of the official document communication, the unit of the communication, the keyword tag, and the topic tag.
进一步地,所述建立模块包括:Further, the establishment module includes:
第一搭建子模块,用于通过Neo4j框架中的创建节点语句依据节点属性搭建出与所述公文对应的各个节点;所述节点属性与所述记录数据和所述公文属性分别对应;The first construction sub-module is used to construct each node corresponding to the official document according to the node attribute through the create node sentence in the Neo4j framework; the node attribute corresponds to the record data and the official document attribute respectively;
第二搭建子模块,用于通过Neo4j框架中的创建关系语句依据预设关系搭建出各个所述节点之间的连接关系;所述预设关系与所述记录数据和所述公文属性分别对应;The second building sub-module is used to build a connection relationship between each of the nodes according to a preset relationship through the creation relationship statement in the Neo4j framework; the preset relationship corresponds to the record data and the official document attribute respectively;
建立子模块,用于通过Neo4j框架中的路径语句依据所述连接关系确定出所有所述节点的路径,建立完成基于图结构的所述公文推荐库。The establishment sub-module is used to determine the paths of all the nodes according to the connection relationship through the path statements in the Neo4j framework, and establish and complete the official document recommendation library based on the graph structure.
进一步地,所述基于图结构的公文推荐装置还包括:Further, the apparatus for recommending official documents based on the graph structure further includes:
选择模块,用于将按照所述相似度的高低依次输出的所述目标公文压缩至链接点中,并在所述用户选择至少一个所述链接点时,以预设视图形式完整呈现出所述链接点中对应的所述目标公文的公文内容;一个所述链接点分别与一篇所述目标公文对应。The selection module is configured to compress the target official documents output in sequence according to the degree of similarity into link points, and when the user selects at least one of the link points, fully present the target document in a preset view. The official document content of the target official document corresponding to the link points; one of the link points corresponds to a piece of the target official document.
关于基于图结构的公文推荐装置的具体限定可以参见上文中对于基于图结构的公文推荐方法的限定,在此不再赘述。上述基于图结构的公文推荐装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。Regarding the specific definition of the device for recommending official documents based on the graph structure, please refer to the above definition of the method for recommending official documents based on the graph structure, which will not be repeated here. The various modules in the above-mentioned apparatus for recommending official documents based on the graph structure can be implemented in whole or in part by software, hardware, and a combination thereof. The above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图4所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储基于图结构的公文推荐方法中涉及到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种基于图结构的公文推荐方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 4. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium. The database of the computer equipment is used to store the data involved in the method for recommending official documents based on the graph structure. The network interface of the computer device is used to communicate with an external terminal through a network connection. When the computer-readable instructions are executed by the processor, a method for recommending official documents based on a graph structure is realized. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该可读存储 介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现上述实施例中基于图结构的公文推荐方法的步骤,例如图2所示的步骤S10至步骤S50。或者,处理器执行计算机可读指令时实现上述实施例中基于图结构的公文推荐装置的各模块/单元的功能,例如图3所示模块11至模块15的功能。为避免重复,这里不再赘述。In one embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the method for recommending official documents based on the graph structure in the above-mentioned embodiment Steps, for example, step S10 to step S50 shown in FIG. 2. Alternatively, when the processor executes the computer-readable instructions, the functions of the modules/units of the apparatus for recommending official documents based on the graph structure in the foregoing embodiment are implemented, for example, the functions of modules 11 to 15 shown in FIG. 3. To avoid repetition, I won’t repeat them here.
在一个实施例中,提供了一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现上述实施例中基于图结构的公文推荐方法的步骤,例如图2所示的步骤S10至步骤S50。或者,计算机可读指令被处理器执行时实现上述实施例中基于图结构的公文推荐装置的各模块/单元的功能,例如图3所示模块11至模块15的功能。为避免重复,这里不再赘述。In one embodiment, a computer-readable storage medium is provided, and computer-readable instructions are stored thereon. When the computer-readable instructions are executed by a processor, the steps of the method for recommending official documents based on the graph structure in the foregoing embodiment are implemented, for example, Steps S10 to S50 shown in FIG. 2. Or, when the computer-readable instructions are executed by the processor, the functions of the modules/units of the apparatus for recommending official documents based on the graph structure in the foregoing embodiments, such as the functions of the modules 11 to 15 shown in FIG. 3, are realized. To avoid repetition, I won’t repeat them here.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the foregoing embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, the computer-readable instructions may include the processes of the foregoing method embodiments when executed. Wherein, any reference to memory, storage, database, or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above functional units and modules is used as an example. In practical applications, the above functions can be allocated to different functional units and modules as needed. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种基于图结构的公文推荐方法,其中,包括:An official document recommendation method based on graph structure, which includes:
    获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;Acquire a variety of official documents with different types of official documents, determine the characteristic words in the acquired documents according to TF-IDF based on preset word statistical characteristics, and filter characteristic words with a frequency greater than or equal to the preset frequency according to TF-IDF, and Recording the selected characteristic words as the keyword tags of the official document corresponding thereto;
    将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;Input the official document into the preset LDA topic model, calculate the distribution probability matrix of the text topic-keyword in the official document through the LDA topic model, and then obtain the LDA topic model according to the text topic of the official document. The distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key The distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
    根据所述关键词标签和所述主题标签生成公文属性;Generating official document attributes according to the keyword tags and the topic tags;
    根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;Obtain the record data of the official document according to each type of official document, and establish an official document recommendation library based on the graph structure based on the record data of the official document and the document attribute through the Neo4j framework; the official document recommendation library contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
    接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
  2. 根据权利要求1所述的基于图结构的公文推荐方法,其中,所述获取具有不同公文类型的多种公文之前,还包括:The method for recommending an official document based on a graph structure according to claim 1, wherein before said acquiring multiple official documents with different types of official documents, the method further comprises:
    通过已训练成功的BERT模型对所述公文的整体篇章结构进行分析,得到一个对所述公文的整体篇章结构的分析结果;所述整体篇章结构是指所述公文的各个组成结构,所述分析结果是对所述公文的各个所述组成结构的完整性和合理性进行判断的结果;Analyze the overall chapter structure of the official document through the successfully trained BERT model, and obtain an analysis result of the overall chapter structure of the official document; the overall chapter structure refers to the various constituent structures of the official document, and the analysis The result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
    在所述分析结果为所述公文的其中一个组成结构不具备所述完整性或/和所述合理性时,从所述公文中提取出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,以突出显示的形式标注出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,并令预设数据接收方对该公文进行修改。When the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality, the constituent structure or/and the inconsistency that is missing in the official document are extracted from the official document. The reasonable composition structure is marked with the missing composition structure or/and the unreasonable composition structure in the official document in a highlighted form, and the preset data recipient is asked to modify the official document.
  3. 根据权利要求1所述的基于图结构的公文推荐方法,其中,所述公文属性还包括数字实体;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,还包括:The method for recommending an official document based on a graph structure according to claim 1, wherein the attributes of the official document further include digital entities; the Neo4j framework is used to establish a graph structure-based document based on the record data of the official document and the official document attributes. Before the official document recommendation library, it also includes:
    通过预设规则模板中的目标实体表达式对所述公文进行数字实体搜索后定位出所述数字实体的目标位置,并通过所述预设规则模板中的抓取规则表达式从所述目标位置抓取所述数字实体。After performing a digital entity search on the official document through the target entity expression in the preset rule template, locate the target location of the digital entity, and use the capture regular expression in the preset rule template to find the target location of the digital entity. Grab the digital entity.
  4. 根据权利要求1所述的基于图结构的公文推荐方法,其中,所述公文属性还包括公文来文时间和来文单位;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,还包括:The method for recommending an official document based on a graph structure according to claim 1, wherein the attributes of the official document further include the time and unit of the official document; the Neo4j framework is based on the record data of the official document and the official document. Before the establishment of an official document recommendation library based on the graph structure, it also includes:
    获取所述公文的公文内容,通过NLP模型从所述公文内容中识别出与时间组成成分对应的所述公文来文时间以及与单位组成成分对应的所述来文单位;Obtain the official document content of the official document, and identify the official document communication time corresponding to the time component and the communication unit corresponding to the unit component from the official document content through the NLP model;
    所述根据所述关键词标签和所述主题标签生成公文属性,包括:The generating of official document attributes according to the keyword tags and the topic tags includes:
    根据所述公文来文时间、所述来文单位、所述关键词标签和所述主题标签生成所述公文属性。The official document attribute is generated according to the time of the official document communication, the communication unit, the keyword tag, and the topic tag.
  5. 根据权利要求1所述的基于图结构的公文推荐方法,其中,所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库,包括:The method for recommending an official document based on a graph structure according to claim 1, wherein the establishment of an official document recommendation database based on the graph structure through the Neo4j framework according to the record data of the official document and the attributes of the official document comprises:
    通过Neo4j框架中的创建节点语句依据节点属性搭建出与所述公文对应的各个节点;所述节点属性与所述记录数据和所述公文属性分别对应;Construct each node corresponding to the official document according to the node attribute through the create node sentence in the Neo4j framework; the node attribute corresponds to the record data and the official document attribute respectively;
    通过Neo4j框架中的创建关系语句依据预设关系搭建出各个所述节点之间的连接关系;所述预设关系与所述记录数据和所述公文属性分别对应;Construct a connection relationship between each of the nodes according to a preset relationship through the creation relationship statement in the Neo4j framework; the preset relationship corresponds to the record data and the official document attribute respectively;
    通过Neo4j框架中的路径语句依据所述连接关系确定出所有所述节点的路径,建立完成基于图结构的所述公文推荐库。The path sentences in the Neo4j framework are used to determine the paths of all the nodes according to the connection relationship, and the official document recommendation library based on the graph structure is established.
  6. 根据权利要求1所述的基于图结构的公文推荐方法,其中,所述依据SimRank计算出的相似度的高低次序输出目标公文之后,还包括:The method for recommending official documents based on graph structure according to claim 1, wherein after outputting the target official documents according to the order of similarity calculated by SimRank, the method further comprises:
    将按照所述相似度的高低依次输出的所述目标公文压缩至链接点中,并在所述用户选择至少一个所述链接点时,以预设视图形式完整呈现出所述链接点中对应的所述目标公文的公文内容;一个所述链接点分别与一篇所述目标公文对应。Compress the target official documents output in sequence according to the degree of similarity into link points, and when the user selects at least one of the link points, the corresponding ones of the link points are completely presented in the form of a preset view The official document content of the target official document; one of the link points corresponds to a piece of the target official document.
  7. 一种基于图结构的公文推荐装置,其中,包括:An official document recommendation device based on a graph structure, which includes:
    第一记录模块,用于获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;The first recording module is used to obtain a variety of official documents with different types of official documents, determine the characteristic words in the obtained official documents according to TF-IDF based on preset word statistical characteristics, and filter according to the TF-IDF the occurrence frequency is greater than or equal to the expected Set frequency characteristic words, and record the selected characteristic words as the corresponding keyword tags of the official document;
    第二记录模块,用于将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;The second recording module is used to input the official document into a preset LDA topic model, calculate the distribution probability matrix of the text topic-keywords in the official document through the LDA topic model, and then obtain the LDA topic model according to The text topic-keyword distribution probability matrix of the official document selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them The distribution probability matrix of the text topic-keyword includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text theme of the official document;
    第一生成模块,用于根据所述关键词标签和所述主题标签生成公文属性;The first generation module is configured to generate official document attributes according to the keyword tags and the topic tags;
    建立模块,用于根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;The establishment module is used to obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure according to the record data of the official document and the attribute of the official document through the Neo4j framework; the official document recommendation database Contains multiple graph structures, one graph structure corresponds to the official document of at least one of the document types, and one graph structure contains multiple nodes connected to each other; one of the nodes represents the record data, all One of the keyword tag and the topic tag;
    计算模块,用于接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The calculation module is configured to receive the search content input by the user from the official document recommendation library, and output the target official documents according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the search content and the node degree.
  8. 根据权利要求7所述的基于图结构的公文推荐装置,其中,所述基于图结构的公文推荐装置还包括:7. The apparatus for recommending official documents based on a graph structure according to claim 7, wherein the apparatus for recommending official documents based on a graph structure further comprises:
    分析模块,用于通过已训练成功的BERT模型对所述公文的整体篇章结构进行分析,得到一个对所述公文的整体篇章结构的分析结果;所述整体篇章结构是指所述公文的各个组成结构,所述分析结果是对所述公文的各个所述组成结构的完整性和合理性进行判断的结果;The analysis module is used to analyze the overall chapter structure of the official document through the successfully trained BERT model to obtain an analysis result of the overall chapter structure of the official document; the overall chapter structure refers to the various components of the official document Structure, the analysis result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
    标注模块,用于在所述分析结果为所述公文的其中一个组成结构不具备所述完整性或/和所述合理性时,从所述公文中提取出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,以突出显示的形式标注出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,并令预设数据接收方对该公文进行修改。The marking module is used to extract the missing components from the official document when the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality The structure or/and the unreasonable composition structure, the missing composition structure or/and the unreasonable composition structure in the official document are marked in the form of highlighting, and the preset data recipient is asked to refer to the official document to modify.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
    获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语, 并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;Acquire a variety of official documents with different types of official documents, determine the characteristic words in the acquired official documents according to TF-IDF based on preset word statistical characteristics, and filter characteristic words with a frequency greater than or equal to the preset frequency according to TF-IDF, and Recording the selected characteristic words as the corresponding keyword tags of the official document;
    将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;Input the official document into the preset LDA topic model, calculate the distribution probability matrix of the text topic-keyword in the official document through the LDA topic model, and then obtain the LDA topic model according to the text topic of the official document. The distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key The distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
    根据所述关键词标签和所述主题标签生成公文属性;Generating official document attributes according to the keyword tags and the topic tags;
    根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;Obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure based on the record data of the official document and the document attribute through the Neo4j framework; the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
    接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
  10. 如权利要求9所述的计算机设备,其中,所述获取具有不同公文类型的多种公文之前,所述处理器执行所述计算机可读指令时还实现如下步骤:9. The computer device according to claim 9, wherein, before said acquiring multiple types of official documents with different types of official documents, the processor further implements the following steps when executing the computer-readable instructions:
    通过已训练成功的BERT模型对所述公文的整体篇章结构进行分析,得到一个对所述公文的整体篇章结构的分析结果;所述整体篇章结构是指所述公文的各个组成结构,所述分析结果是对所述公文的各个所述组成结构的完整性和合理性进行判断的结果;Analyze the overall chapter structure of the official document through the successfully trained BERT model, and obtain an analysis result of the overall chapter structure of the official document; the overall chapter structure refers to the various constituent structures of the official document, and the analysis The result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
    在所述分析结果为所述公文的其中一个组成结构不具备所述完整性或/和所述合理性时,从所述公文中提取出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,以突出显示的形式标注出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,并令预设数据接收方对该公文进行修改。When the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality, the constituent structure or/and the inconsistency that is missing in the official document are extracted from the official document. The reasonable composition structure is marked with the missing composition structure or/and the unreasonable composition structure in the official document in a highlighted form, and the preset data recipient is asked to modify the official document.
  11. 如权利要求9所述的计算机设备,其中,所述公文属性还包括数字实体;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,所述处理器执行所述计算机可读指令时还实现如下步骤:8. The computer device according to claim 9, wherein the attributes of the official document further comprise a digital entity; before the establishment of the official document recommendation database based on the graph structure according to the record data of the official document and the attribute of the official document through the Neo4j framework, The processor further implements the following steps when executing the computer-readable instructions:
    通过预设规则模板中的目标实体表达式对所述公文进行数字实体搜索后定位出所述数字实体的目标位置,并通过所述预设规则模板中的抓取规则表达式从所述目标位置抓取所述数字实体。After performing a digital entity search on the official document through the target entity expression in the preset rule template, locate the target location of the digital entity, and use the capture regular expression in the preset rule template to find the target location of the digital entity. Grab the digital entity.
  12. 如权利要求9所述的计算机设备,其中,所述公文属性还包括公文来文时间和来文单位;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein the attributes of the official document further include the time and unit of the official document; the Neo4j framework is used to establish a graph structure based on the record data of the official document and the attribute of the official document. Before the official document recommendation library of, the processor further implements the following steps when executing the computer-readable instructions:
    获取所述公文的公文内容,通过NLP模型从所述公文内容中识别出与时间组成成分对应的所述公文来文时间以及与单位组成成分对应的所述来文单位;Obtain the official document content of the official document, and identify the official document communication time corresponding to the time component and the communication unit corresponding to the unit component from the official document content through the NLP model;
    所述根据所述关键词标签和所述主题标签生成公文属性,包括:The generating of official document attributes according to the keyword tags and the topic tags includes:
    根据所述公文来文时间、所述来文单位、所述关键词标签和所述主题标签生成所述公文属性。The official document attribute is generated according to the time of the official document communication, the communication unit, the keyword tag, and the topic tag.
  13. 如权利要求9所述的计算机设备,其中,所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库,包括:9. The computer device according to claim 9, wherein said establishing an official document recommendation database based on a graph structure according to said record data of said official document and said official document attribute through the Neo4j framework comprises:
    通过Neo4j框架中的创建节点语句依据节点属性搭建出与所述公文对应的各个节点;所述节点属性与所述记录数据和所述公文属性分别对应;Construct each node corresponding to the official document according to the node attribute through the create node sentence in the Neo4j framework; the node attribute corresponds to the record data and the official document attribute respectively;
    通过Neo4j框架中的创建关系语句依据预设关系搭建出各个所述节点之间的连接关系;所述预设关系与所述记录数据和所述公文属性分别对应;Construct a connection relationship between each of the nodes according to a preset relationship through the creation relationship statement in the Neo4j framework; the preset relationship corresponds to the record data and the official document attribute respectively;
    通过Neo4j框架中的路径语句依据所述连接关系确定出所有所述节点的路径,建立完 成基于图结构的所述公文推荐库。The path sentences in the Neo4j framework determine the paths of all the nodes according to the connection relationship, and establish and complete the official document recommendation library based on the graph structure.
  14. 如权利要求9所述的计算机设备,其中,所述依据SimRank计算出的相似度的高低次序输出目标公文之后,所述处理器执行所述计算机可读指令时还实现如下步骤:9. The computer device according to claim 9, wherein after the target document is output according to the order of similarity calculated by SimRank, the processor further implements the following steps when executing the computer readable instruction:
    将按照所述相似度的高低依次输出的所述目标公文压缩至链接点中,并在所述用户选择至少一个所述链接点时,以预设视图形式完整呈现出所述链接点中对应的所述目标公文的公文内容;一个所述链接点分别与一篇所述目标公文对应。Compress the target official documents output in sequence according to the degree of similarity into link points, and when the user selects at least one of the link points, the corresponding ones of the link points are completely presented in the form of a preset view The official document content of the target official document; one of the link points corresponds to a piece of the target official document.
  15. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取具有不同公文类型的多种公文,根据TF-IDF基于预设的词语统计特征确定获取的所述公文中的特征词语,根据TF-IDF筛选出现频率大于或等于预设频率的特征词语,并将筛选出的所述特征词语记录为与其对应的所述公文的关键词标签;Acquire a variety of official documents with different types of official documents, determine the characteristic words in the acquired documents according to TF-IDF based on preset word statistical characteristics, and filter characteristic words with a frequency greater than or equal to the preset frequency according to TF-IDF, and Recording the selected characteristic words as the keyword tags of the official document corresponding thereto;
    将所述公文输入至预设的LDA主题模型,通过所述LDA主题模型计算所述公文中的文本主题-关键词的分布概率矩阵,再获取所述LDA主题模型根据所述公文的文本主题-关键词的分布概率矩阵筛选出的选取概率大于或等于预设概率的所述文本主题,并将筛选出的所述文本主题记录为与其对应的所述公文的主题标签;所述文本主题-关键词的分布概率矩阵中包含多个所述选取概率,所述选取概率是指所述公文中的关键词属于该公文的文本主题的概率;Input the official document into the preset LDA topic model, calculate the distribution probability matrix of the text topic-keyword in the official document through the LDA topic model, and then obtain the LDA topic model according to the text topic of the official document. The distribution probability matrix of keywords selects the text topics whose selection probability is greater than or equal to the preset probability, and records the selected text topics as the topic tags of the official documents corresponding to them; the text topic-key The distribution probability matrix of a word includes a plurality of the selection probabilities, and the selection probability refers to the probability that the keywords in the official document belong to the text topic of the official document;
    根据所述关键词标签和所述主题标签生成公文属性;Generating official document attributes according to the keyword tags and the topic tags;
    根据每一种公文类型获取所述公文的记录数据,通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库;所述公文推荐库中包含多个图结构,一个所述图结构对应至少一种所述公文类型的所述公文,一个所述图结构中包含相互连接的多个节点;一个所述节点代表所述记录数据、所述关键词标签和所述主题标签中的一种;Obtain the record data of the official document according to each type of official document, and establish an official document recommendation database based on the graph structure based on the record data of the official document and the document attribute through the Neo4j framework; the official document recommendation database contains multiple images Structure, one said graph structure corresponds to said official document of at least one of said official document type, one said graph structure contains multiple nodes connected to each other; one said node represents said record data, said keyword tag and One of the subject tags;
    接收到用户自所述公文推荐库中输入的检索内容,依据SimRank计算出的相似度的高低次序输出目标公文;所述相似度是指所述检索内容与所述节点的相似度。The retrieval content input by the user from the official document recommendation library is received, and the target document is output according to the order of similarity calculated by SimRank; the similarity refers to the similarity between the retrieval content and the node.
  16. 如权利要求15所述的可读存储介质,其中,所述获取具有不同公文类型的多种公文之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 15, wherein, before the acquiring multiple types of official documents with different types of official documents, when the computer-readable instructions are executed by one or more processors, the one or more The processor also performs the following steps:
    通过已训练成功的BERT模型对所述公文的整体篇章结构进行分析,得到一个对所述公文的整体篇章结构的分析结果;所述整体篇章结构是指所述公文的各个组成结构,所述分析结果是对所述公文的各个所述组成结构的完整性和合理性进行判断的结果;Analyze the overall chapter structure of the official document through the successfully trained BERT model, and obtain an analysis result of the overall chapter structure of the official document; the overall chapter structure refers to the various constituent structures of the official document, and the analysis The result is the result of judging the completeness and rationality of each of the constituent structures of the official document;
    在所述分析结果为所述公文的其中一个组成结构不具备所述完整性或/和所述合理性时,从所述公文中提取出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,以突出显示的形式标注出所述公文中缺失的所述组成结构或/和不合理的所述组成结构,并令预设数据接收方对该公文进行修改。When the analysis result is that one of the constituent structures of the official document does not have the completeness or/and the rationality, the constituent structure or/and the inconsistency that is missing in the official document are extracted from the official document. The reasonable composition structure is marked with the missing composition structure or/and the unreasonable composition structure in the official document in a highlighted form, and the preset data recipient is asked to modify the official document.
  17. 如权利要求15所述的可读存储介质,其中,所述公文属性还包括数字实体;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 15, wherein the attributes of the official document further include digital entities; the Neo4j framework establishes an official document recommendation database based on a graph structure according to the recorded data of the official document and the attributes of the official document Previously, when the computer-readable instructions were executed by one or more processors, the one or more processors further performed the following steps:
    通过预设规则模板中的目标实体表达式对所述公文进行数字实体搜索后定位出所述数字实体的目标位置,并通过所述预设规则模板中的抓取规则表达式从所述目标位置抓取所述数字实体。After performing a digital entity search on the official document through the target entity expression in the preset rule template, locate the target location of the digital entity, and use the capture regular expression in the preset rule template to find the target location of the digital entity. Grab the digital entity.
  18. 如权利要求15所述的可读存储介质,其中,所述公文属性还包括公文来文时间和来文单位;所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个 或多个处理器还执行如下步骤:The readable storage medium according to claim 15, wherein the attributes of the official document further include the time and unit of the official document; the Neo4j framework is used to establish a document based on the recorded data and the attribute of the official document. Before the official document recommendation library of the graph structure, when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:
    获取所述公文的公文内容,通过NLP模型从所述公文内容中识别出与时间组成成分对应的所述公文来文时间以及与单位组成成分对应的所述来文单位;Obtain the official document content of the official document, and identify the official document communication time corresponding to the time component and the communication unit corresponding to the unit component from the official document content through the NLP model;
    所述根据所述关键词标签和所述主题标签生成公文属性,包括:The generating of official document attributes according to the keyword tags and the topic tags includes:
    根据所述公文来文时间、所述来文单位、所述关键词标签和所述主题标签生成所述公文属性。The official document attribute is generated according to the time of the official document communication, the communication unit, the keyword tag, and the topic tag.
  19. 如权利要求15所述的可读存储介质,其中,所述通过Neo4j框架根据所述公文的所述记录数据和所述公文属性建立基于图结构的公文推荐库,包括:15. The readable storage medium according to claim 15, wherein the establishment of a document recommendation database based on a graph structure according to the recorded data of the document and the document attribute through the Neo4j framework comprises:
    通过Neo4j框架中的创建节点语句依据节点属性搭建出与所述公文对应的各个节点;所述节点属性与所述记录数据和所述公文属性分别对应;Construct each node corresponding to the official document according to the node attribute through the create node sentence in the Neo4j framework; the node attribute corresponds to the record data and the official document attribute respectively;
    通过Neo4j框架中的创建关系语句依据预设关系搭建出各个所述节点之间的连接关系;所述预设关系与所述记录数据和所述公文属性分别对应;Construct a connection relationship between each of the nodes according to a preset relationship through the creation relationship statement in the Neo4j framework; the preset relationship corresponds to the record data and the official document attribute respectively;
    通过Neo4j框架中的路径语句依据所述连接关系确定出所有所述节点的路径,建立完成基于图结构的所述公文推荐库。The path sentences in the Neo4j framework are used to determine the paths of all the nodes according to the connection relationship, and the official document recommendation library based on the graph structure is established.
  20. 如权利要求15所述的可读存储介质,其中,所述依据SimRank计算出的相似度的高低次序输出目标公文之后,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 15, wherein after the target document is output according to the order of similarity calculated by SimRank, when the computer readable instruction is executed by one or more processors, the One or more processors also perform the following steps:
    将按照所述相似度的高低依次输出的所述目标公文压缩至链接点中,并在所述用户选择至少一个所述链接点时,以预设视图形式完整呈现出所述链接点中对应的所述目标公文的公文内容;一个所述链接点分别与一篇所述目标公文对应。Compress the target official documents output in sequence according to the degree of similarity into link points, and when the user selects at least one of the link points, the corresponding ones of the link points are completely presented in the form of a preset view The official document content of the target official document; one of the link points corresponds to a piece of the target official document.
PCT/CN2020/116744 2020-05-29 2020-09-22 Graph structure-based official document recommendation method, apparatus, computer device, and medium WO2021114810A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010475897.9A CN111666401B (en) 2020-05-29 2020-05-29 Document recommendation method, device, computer equipment and medium based on graph structure
CN202010475897.9 2020-05-29

Publications (1)

Publication Number Publication Date
WO2021114810A1 true WO2021114810A1 (en) 2021-06-17

Family

ID=72385175

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116744 WO2021114810A1 (en) 2020-05-29 2020-09-22 Graph structure-based official document recommendation method, apparatus, computer device, and medium

Country Status (2)

Country Link
CN (1) CN111666401B (en)
WO (1) WO2021114810A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722582A (en) * 2021-07-29 2021-11-30 黑龙江先进信息技术有限公司 Recommendation method, system, program product and medium based on pet feature tag
CN114115878A (en) * 2021-11-29 2022-03-01 杭州数梦工场科技有限公司 Workflow node recommendation method and device
CN115994261A (en) * 2022-11-11 2023-04-21 广州宏天软件股份有限公司 Numerical value recommendation method in form linkage change
CN116541377A (en) * 2023-04-27 2023-08-04 阿里巴巴(中国)有限公司 Processing method and system of materialized view of task and electronic equipment

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111666401B (en) * 2020-05-29 2023-06-30 平安科技(深圳)有限公司 Document recommendation method, device, computer equipment and medium based on graph structure
CN112131471B (en) * 2020-09-23 2023-10-20 平安国际智慧城市科技股份有限公司 Method, device, equipment and medium for recommending relationship based on unowned undirected graph
CN113553825B (en) * 2021-07-23 2023-03-21 安徽商信政通信息技术股份有限公司 Method and system for analyzing context relationship of electronic official document
CN115168567B (en) * 2022-09-07 2022-12-02 北京慧点科技有限公司 Knowledge graph-based object recommendation method
CN115238065B (en) * 2022-09-22 2022-12-20 太极计算机股份有限公司 Intelligent document recommendation method based on federal learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255283B1 (en) * 2016-09-19 2019-04-09 Amazon Technologies, Inc. Document content analysis based on topic modeling
CN110532451A (en) * 2019-06-26 2019-12-03 平安科技(深圳)有限公司 Search method and device for policy text, storage medium, electronic device
CN110889045A (en) * 2019-10-12 2020-03-17 平安科技(深圳)有限公司 Label analysis method, device and computer readable storage medium
CN111666401A (en) * 2020-05-29 2020-09-15 平安科技(深圳)有限公司 Official document recommendation method and device based on graph structure, computer equipment and medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425799B (en) * 2013-09-04 2016-06-15 北京邮电大学 Individuation research direction commending system and recommend method based on theme
CN105677769B (en) * 2015-12-29 2018-01-05 广州神马移动信息科技有限公司 One kind is based on latent Dirichletal location(LDA)The keyword recommendation method and system of model
CN108399228B (en) * 2018-02-12 2020-11-13 平安科技(深圳)有限公司 Article classification method and device, computer equipment and storage medium
CN108491529B (en) * 2018-03-28 2021-11-16 百度在线网络技术(北京)有限公司 Information recommendation method and device
CN111061957A (en) * 2019-12-26 2020-04-24 广东电网有限责任公司 Article similarity recommendation method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10255283B1 (en) * 2016-09-19 2019-04-09 Amazon Technologies, Inc. Document content analysis based on topic modeling
CN110532451A (en) * 2019-06-26 2019-12-03 平安科技(深圳)有限公司 Search method and device for policy text, storage medium, electronic device
CN110889045A (en) * 2019-10-12 2020-03-17 平安科技(深圳)有限公司 Label analysis method, device and computer readable storage medium
CN111666401A (en) * 2020-05-29 2020-09-15 平安科技(深圳)有限公司 Official document recommendation method and device based on graph structure, computer equipment and medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722582A (en) * 2021-07-29 2021-11-30 黑龙江先进信息技术有限公司 Recommendation method, system, program product and medium based on pet feature tag
CN114115878A (en) * 2021-11-29 2022-03-01 杭州数梦工场科技有限公司 Workflow node recommendation method and device
CN115994261A (en) * 2022-11-11 2023-04-21 广州宏天软件股份有限公司 Numerical value recommendation method in form linkage change
CN115994261B (en) * 2022-11-11 2023-07-07 广州宏天软件股份有限公司 Numerical value recommendation method in form linkage change
CN116541377A (en) * 2023-04-27 2023-08-04 阿里巴巴(中国)有限公司 Processing method and system of materialized view of task and electronic equipment

Also Published As

Publication number Publication date
CN111666401B (en) 2023-06-30
CN111666401A (en) 2020-09-15

Similar Documents

Publication Publication Date Title
WO2021114810A1 (en) Graph structure-based official document recommendation method, apparatus, computer device, and medium
WO2019136993A1 (en) Text similarity calculation method and device, computer apparatus, and storage medium
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN107992585B (en) Universal label mining method, device, server and medium
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN107168954B (en) Text keyword generation method and device, electronic equipment and readable storage medium
WO2019091026A1 (en) Knowledge base document rapid search method, application server, and computer readable storage medium
EP3958145A1 (en) Method and apparatus for semantic retrieval, device and storage medium
WO2020063092A1 (en) Knowledge graph processing method and apparatus
CN109710851B (en) Employment recommendation method and system based on multi-source data analysis in Internet mode
US20160034512A1 (en) Context-based metadata generation and automatic annotation of electronic media in a computer network
WO2021120627A1 (en) Data search matching method and apparatus, computer device, and storage medium
WO2020237856A1 (en) Smart question and answer method and apparatus based on knowledge graph, and computer storage medium
US10482146B2 (en) Systems and methods for automatic customization of content filtering
US10810377B2 (en) Method and system for information retreival
US20140379719A1 (en) System and method for tagging and searching documents
WO2022227162A1 (en) Question and answer data processing method and apparatus, and computer device and storage medium
CN112883030A (en) Data collection method and device, computer equipment and storage medium
WO2022141872A1 (en) Document abstract generation method and apparatus, computer device, and storage medium
EP3961433A2 (en) Data annotation method and apparatus, electronic device and storage medium
CN114547257B (en) Class matching method and device, computer equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information
US20140164432A1 (en) Ontology enhancement method and system
CN110717008B (en) Search result ordering method and related device based on semantic recognition
CN111625579B (en) Information processing method, device and system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20899526

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20899526

Country of ref document: EP

Kind code of ref document: A1