CN112487160B - Technical document tracing method and device, computer equipment and computer storage medium - Google Patents

Technical document tracing method and device, computer equipment and computer storage medium Download PDF

Info

Publication number
CN112487160B
CN112487160B CN202011337966.6A CN202011337966A CN112487160B CN 112487160 B CN112487160 B CN 112487160B CN 202011337966 A CN202011337966 A CN 202011337966A CN 112487160 B CN112487160 B CN 112487160B
Authority
CN
China
Prior art keywords
document
technical
feature vector
technical document
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011337966.6A
Other languages
Chinese (zh)
Other versions
CN112487160A (en
Inventor
殷达
谭咏霖
丁铭
唐杰
刘德兵
仇瑜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhipu Huazhang Technology Co ltd
Original Assignee
Beijing Zhipu Huazhang Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhipu Huazhang Technology Co Ltd filed Critical Beijing Zhipu Huazhang Technology Co Ltd
Priority to CN202011337966.6A priority Critical patent/CN112487160B/en
Publication of CN112487160A publication Critical patent/CN112487160A/en
Application granted granted Critical
Publication of CN112487160B publication Critical patent/CN112487160B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The invention can provide a technical document tracing method and device, computer equipment and a computer storage medium. The technical document source tracing method can comprise the following steps: and searching a plurality of reference technical documents with association relation based on the target technical document. A feature vector is created for each technical document, the feature vector being used to characterize textual features of the technical document and associated features between different technical documents. Clustering the reference technology documents based on the feature vectors to form a plurality of document sets. And for each document set, arranging the reference technical documents according to the time relationship to form a tracing route. And generating a tree diagram for representing the source tracing result of the target technical document. The method can perform traceability analysis on the technical documents, dig out influence relations and useful information among the technical documents, quickly generate the traceability tree representing the traceability result of the target technical document, vividly depict the evolution process of technology or thought, and meet the requirement that scientific researchers and other users locate the content which really needs to be deeply read.

Description

Technical document tracing method and device, computer equipment and computer storage medium
Technical Field
The invention relates to the technical field of technical document processing, in particular to a technical document tracing method and device, computer equipment and a computer storage medium.
Background
With the increasing proliferation of academic research, more and more academic papers are being produced at faster and faster rates. For researchers, students or enthusiasts stepping into a new field, a large amount of retrieval and reading of relevant contents of basic knowledge points mentioned in an academic paper are often needed when reading the academic paper, a large amount of time is needed in the retrieval process due to low familiarity with the new field, problems of reading of irrelevant materials due to wrong learning direction and the like often occur, and the efficiency of obtaining useful knowledge is low. For leading-edge researchers, it is sometimes necessary to summarize and summarize a technical development process to re-innovate, this process requires leading-edge researchers to manually analyze a large amount of technical documents such as academic papers, often occupying a large amount of precious time, and the manual analysis process has too much dependence on factors such as experience of the personnel and subjective attention degree.
Therefore, how to effectively assist the user such as a researcher or a leading-edge technologist to improve the efficiency of acquiring useful knowledge and shorten the time spent in the search process and the manual analysis time as much as possible becomes a key point of the technical problem to be solved and the research in the future by the technical staff in the field.
Disclosure of Invention
In order to solve the problems of low useful knowledge acquisition efficiency, long time for manually analyzing technical documents and the like in the prior art, the invention provides a technical document tracing method and device, computer equipment and a computer storage medium, and aims to assist users in improving knowledge acquisition efficiency, shortening retrieval and analysis time and the like.
To achieve the above technical objects, one or more embodiments of the present invention can provide a technical document tracing method, which may include, but is not limited to, at least one of the following steps.
And searching and obtaining a plurality of reference technical documents which have an association relation with the target technical document based on the given target technical document.
Creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents include a target technical document and a reference technical document.
Clustering the reference technology documents based on the feature vectors to form a plurality of document sets.
And for each document set, arranging the reference technical documents according to the time relationship to form a tracing route.
And taking the target technical document as a root node, taking the reference technical document as a leaf node, and connecting the root node and the leaf node according to the tracing route to generate a tree diagram for representing the tracing result of the target technical document.
Further, the technical document tracing method further comprises the following steps:
and respectively generating the label of each document set according to the keyword information of the reference technical document in the document set.
And correspondingly setting labels for all the tracing routes in the tree diagram.
Further, the technical document tracing method further comprises the following steps:
and calculating the influence value of each document set on the target technical document according to the feature vector of each technical document.
And marking each tracing route in the tree diagram according to the influence value.
Further, the creating of the feature vector of each technical document includes:
and extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document.
Creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; the first association relationship is an association relationship between the target technical document and each reference technical document, and the second association relationship is an association relationship between different reference technical documents.
And creating a feature vector of each technical document according to the text feature vector and the graph feature vector.
Further, the creating a text feature vector using the text data comprises:
and extracting a first vector from the text data based on a word frequency-inverse text frequency index mode.
A second vector is extracted from the text data based on the way the sentence-from-transformer bidirectional encoder tokens.
Creating the text feature vector from the first vector and the second vector.
Further, the clustering the reference technology documents based on the feature vectors comprises:
and clustering the reference technical documents according to the feature vectors of the reference technical documents, the first incidence relation and the second incidence relation.
Further, the target technology document is a paper, the reference technology document is a paper directly referenced by the paper and/or a paper indirectly referenced by the paper, and the association relationship is a reference relationship.
To achieve the above technical objects, one or more embodiments of the present invention may further provide a technical document tracing apparatus, which may include, but is not limited to, a document searching module, a vector creating module, a clustering module, a tracing route generating module, and a tree diagram generating module.
And the document searching module is used for searching and obtaining a plurality of reference technical documents which have incidence relations with the target technical documents based on the given target technical documents.
The system comprises a vector creating module, a feature vector generating module and a feature vector generating module, wherein the vector creating module is used for creating a feature vector of each technical document, and the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents include a target technical document and a reference technical document.
And the clustering processing module is used for clustering the reference technical documents based on the characteristic vectors to form a plurality of document sets.
And the source tracing route generating module is used for arranging the reference technical documents according to the time relationship for each document set to form a source tracing route.
And the tree diagram generating module is used for connecting the root node and the leaf nodes according to the tracing route by taking the target technical document as the root node and the reference technical document as the leaf nodes to generate a tree diagram for representing the tracing result of the target technical document.
To achieve the above technical object, the present invention can also provide a computer device, which includes a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the technical document tracing method in any embodiment of the present invention.
To achieve the above technical objects, the present invention may also provide a computer storage medium having computer readable instructions stored thereon, which when executed by a processor implement the steps of the technical document tracing method according to any embodiment of the present invention.
The invention has the beneficial effects that: the method can perform traceability analysis on the given technical documents, dig out influence relations and useful information among the technical documents, and quickly generate the traceability tree representing the traceability result of the target technical document, so as to vividly depict the evolution process of the technology or thought in a paper map mode, and meet the requirement of accurately, quickly and intuitively positioning the content which really needs to be deeply read by users such as scientific researchers.
For researchers, students or enthusiasts stepping into a new field, the invention can provide relevant papers of relevant knowledge points related to academic papers to be learned for the users, greatly reduces the time spent on user retrieval and saves the time of reading documents with lower relevance. For the leading-edge scientific researchers, the method and the system can provide the source-tracing analysis and evolution process of one or more technologies for the leading-edge scientific researchers, assist the users in accurately and quickly summarizing the technology development and the technology evolution, and help the inspiring users to reveal the possible next potential technical innovation point. Compared with the conventional technology, the method and the device thoroughly solve the problems of low useful knowledge acquisition efficiency, overlong time for manually analyzing technical documents and the like, and provide great help for users such as leading-edge scientific researchers, students and enthusiasts.
Drawings
FIG. 1 is a flow diagram illustrating a method for tracing a technical document according to one or more embodiments of the invention.
FIG. 2 illustrates a flow diagram for forming a complete technical document traceability tree in one or more embodiments of the present invention.
FIG. 3 is a diagram illustrating a complete paper traceability tree generated in one embodiment of the present invention for characterizing a target paper traceability result.
FIG. 4 is a diagram illustrating the components of a technical document tracing apparatus in one or more embodiments of the invention.
FIG. 5 shows a schematic diagram of the internal structure of a computer device in one or more embodiments of the invention.
Detailed Description
The technical document tracing method and apparatus, the computer device, and the computer storage medium provided by the present invention are explained and explained in detail below with reference to the drawings of the specification.
As shown in fig. 1, in conjunction with fig. 2, one or more embodiments of the present invention can provide a technical document tracing method that can form a complete technical document tracing tree for characterizing a target technical document tracing result, and the technical document tracing method can include, but is not limited to, at least one of the following steps.
Step 100, a plurality of reference technical documents having an association relation with a target technical document are found based on a given target technical document. The target technical document can be a paper, the reference technical document is a paper directly quoted by the paper and/or a paper indirectly quoted by the paper, and the association relationship is a quote relationship. Therefore, the method can be used for automatically analyzing the evolution root of the academic thesis, and the purposes of automatically analyzing the thought root of the academic thesis and the like are achieved. Of course, the target technical document may also be a technical document such as a patent document, a periodical, and the like.
Taking a paper as an example, the invention can find the paper quoted by the target paper from the paper library. The method can be carried out on a thesis citation relation network according to breadth-first search starting from a target thesis, and after the search at the same depth is finished, if the number of the obtained thesis sets is more than the set number of cited thesis, the search is stopped; otherwise, the search is continued by deepening one layer until the set number of the cited papers can be reached. In the embodiment of the present invention, when searching for a reference technology document (e.g., a cited paper), if the number of the obtained paper sets is greater than the set number of cited papers, the method further includes a step of screening papers cited by the target paper. The screening step comprises: sequencing the obtained reference technical documents in a mode of reconstructing the obtained candidate papers into a citation relation network and calculating the score of each paper, wherein the score calculation method can be but is not limited to a webpage ranking (PageRank) algorithm, and the score calculation method can also be the cited times of the papers and the like; then, sorting the candidate papers according to the order and the score, wherein the sorting mode can be that the paper with small order is in the front, the paper with large order is in the back, and the paper with the same order is in the front with high score; and finally, screening out a set number of candidate papers from front to back according to the sorting result.
In the invention, the paper directly referenced by the paper forms a set R1, the paper indirectly referenced by the paper forms a set R2, and the paper indirectly referenced by the paper comprises second-order reference and higher-order reference. It is understood that the paper in set R2 is a paper referenced directly or indirectly by the paper in set R1. Some embodiments of the present invention can focus on an important paper having a closer relationship with a target paper, so that the number of cited papers in the finally formed paper tracing tree can be set to be 100 in a default configuration.
As shown in fig. 2, the papers having important influence on the target paper, the papers directly referenced by the target paper such as GPT [ Radford, 2018], ELMo [ Peters, 2018], GloVe [ Pennington, 2014], and the like, and the papers indirectly referenced by the target paper such as Seq2Seq [ Sutskever, 2014 ].
Step 200, creating a feature vector of each technical document, wherein the feature vector is used for representing the text features of the technical documents and the associated features between different technical documents. The technical documents include a target technical document and a reference technical document. The method converts the traceability process of the target technical document into the calculation process of the feature vector, and obtains the paper traceability tree based on each feature vector. Taking a paper q as an example, the present invention can be abstracted as a paper traceability tree (V, E, C, W) for computing the paper q. Wherein, V represents nodes of the tracing tree, and each node is a related citation paper; e represents the edges of the traceback tree, each edge representing a potential route of the evolution of the paper; c represents a number of tracing routes, each route including some representative labels referring to papers and their contents; w represents an influence value (or score), and each cited paper contains a value (or score) for the influence of the target paper.
The step 200 of the present invention may include steps 201 to 203, that is, creating the feature vector of each technical document includes the following steps 201 to 203.
Step 201, extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document. Wherein creating the text feature vector using the text data comprises: extracting a first vector from text data based on a term frequency-inverse text frequency index (TF-IDF) mode, wherein the invention can extract a sparse vector by using the term frequency-inverse text frequency index mode, and the sparse vector is used as the first vector; moreover, word stem processing can be carried out during text data processing, so that words with the same meaning in different forms are changed into the same form, for example, ding, done and did are unified into do; it is also possible to extract all words and phrases at a set length (e.g., between 1-5 words) using an n-gram method, and only the set number (e.g., 2000) of words with the highest frequency of occurrence can be calculated. Extracting a second vector from the text data based on a Sentence-Bidirectional Encoder representation (S-BERT) mode from the transformer, wherein the extraction mode can be coding a text sequence to obtain the second vector; a fixed number (e.g., 512) of words in the text data (e.g., abstract) may be directly intercepted as input during the encoding process.
The text feature vector is created according to the first vector and the second vector, the text feature vector can be obtained by directly combining the first vector and the second vector, and the first vector and the second vector can be combined after denoising and the like. The text feature vector can accurately express the content of the technical document from the language level, and the invention fully considers the information of the content of the technical document and realizes the deep excavation of the technical content.
Taking the paper as an example, the invention can extract text data from the title, abstract and other contents of the paper, and certainly is not limited to the title and abstract.
Step 202, creating a graph feature vector based on the text feature vector, the first association relation and the second association relation, wherein the graph feature vector is used for representing association features among different technical documents, and the graph feature vector can express technical document contents from a mutual association structure level among reference technical documents. The first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. In particular, some embodiments of the invention can use a fast graph embedding representation method (ProNE) to obtain graph feature vectors: the first step is that fast Singular Value Decomposition (SVD) Decomposition is carried out on an adjacent matrix of the graph, and the adjacent matrix of the graph can be used for representing a first incidence relation and a second incidence relation, so that an initial vector of each node in the graph is obtained; and the second step is to filter the adjacent matrix of the graph in a spectrum space, and then perform feature propagation on the filtered adjacent matrix based on the initial vector obtained in the first step, wherein the content of the feature propagation can be specifically the content in the text feature vector, so that the graph feature vector can be obtained.
And step 203, creating a feature vector of each technical document according to the text feature vector and the graph feature vector. The feature vectors of the technical documents can be obtained through iterative computation, and the iterative computation is understood as a feature propagation mode, such as being implemented by using a Propagate function (propagation function).
Figure BDA0002797800610000091
Wherein x isoA feature vector representing a technical document,
Figure BDA0002797800610000092
a text feature vector representing a technical document,
Figure BDA0002797800610000093
a graph feature vector representing a technical document.
Step 300, clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing of the reference technical documents based on the feature vectors of the invention can comprise: clustering processing is carried out on the reference technical documents by taking the characteristic vectors, the first incidence relation and the second incidence relation of the reference technical documents as the basis, the invention can specifically adopt a Kernel K-mean clustering (Kernel K-means) algorithm to cluster the reference technical documents, and the Euclidean distance is taken as the basis of clustering. Calculate each point xiRelative to class CtCenter of a ship
Figure BDA0002797800610000101
The euclidean distance of (a) may be expressed as:
Figure BDA0002797800610000102
Figure BDA0002797800610000103
wherein the content of the first and second substances,
Figure BDA0002797800610000104
the goal is to have the points with a close euclidean distance be grouped into the same class. Second term α Aij(default α ═ 1.0) where A is the adjacency matrix, AijPresentation paper piAnd paper pjWhether there is a reference relationship between them is emphasized by the spectral clustering adopted in this embodiment, and the setting A isijThe goal is to have points with reference relationships grouped into the same class as much as possible. Third term [ beta ]ij(default β ═ 1.0) is an additional constraint term, which can be set as the case may be.
Taking fig. 2 as an example, QANet [ Yu, 2018] and sqaad [ Rajpurkar, 2016] are model papers and data set papers in the machine reading field direction, respectively, and both papers refer to "reading", "query answer", "sqaad", and other machine reading understanding related words in the abstract, and there are many common neighbor nodes in the reference relationship network, such as trivia qa [ Joshi, 2017] and U-Net [ Sun, 2018], QANet [ Yu, 2018] and sqaad [ Rajpurkar, 2016] in the feature space, and therefore, they are grouped in the same cluster. Similarly, the articles Attention, GoogleNMT, Seq2Seq, etc. in the field of machine translation are also classified into one category.
And 400, arranging the reference technical documents according to the time relationship to form a tracing route for each document set. Taking the paper as an example, sorting is performed according to the reference order and the time sequence. As shown in fig. 2 and 3, papers in the same category can be linked into two timelines, one main timeline is composed of direct references, and the other secondary timeline is composed of indirect references. The latest published papers in the secondary timeline are linked with nodes on the primary timeline of the same period, thereby connecting the papers of the whole category. The latest published main timeline paper in all the categories is further connected with the target paper, so that all the cited papers and the target paper are connected into a paper source tracing tree which takes the target paper as a root node and the cited papers as leaf nodes.
As shown in fig. 2, Seq2Seq is not divided into the main timeline in which the Attention and google nmt are located as an indirect reference paper, but appears in the sub timeline and is linked to the nearest google nmt node of the main timeline. Three newly published papers QANT, GPT, Attention of different categories are all connected to the target paper (taking target paper "BERT" as an example), thereby forming a complete traceable tree framework.
The invention can respectively generate the label of each document set according to the keyword information of the reference technology document in the document set. The label may be, for example, category-related information in the cluster, and may be, for example, a word with a high frequency of occurrence in each of the plurality of reference technical documents. In some embodiments of the present invention, the tags in the document set may be determined by a word distribution co-occurrence selection manner, and it may be understood that, in the tag selection process, all texts in the reference technical documents in each category may be regarded as a first word distribution, and each tag forms a second word distribution according to co-occurrence with other words.
As shown in FIG. 2, the tags of the categories QANT and SQuAD are "read comprehension", while the tags of the categories of Attention, GoogleNMT and Seq2Seq are "machine translation", and the tags of the categories of GPT, ELMo and GloVe are "language models".
The invention can also calculate the influence value of each document set on the target technical document according to the feature vector of each technical document.
Specifically, the invention can be used for clustering K used in the processijAs a reference technical document piInfluence on the target technical document q, influence value
Figure BDA0002797800610000121
Wherein iqRepresents the subscript of the target technical document q. It can be understood that the invention can also calculate the influence value of each category on the basis of the reference technical document
Figure BDA0002797800610000122
The influence of each category describes the degree of influence of the traceable route on the target technical document.
As shown in fig. 2, some embodiments of the present invention may represent the magnitude of the influence value by the shade of the color and the thickness of the tracing route. The tracing routes of QANT and SquAD under the label of 'reading understanding' are thicker, and the influence degree of the tracing routes on the target technical document is considered to be larger. The markup color of the Attention and GoogleNMT under the "machine translation" tag is darker, and the extent of the influence of the Attention and GoogleNMT technical documents on the target technical documents can be considered to be greater.
And 500, taking the target technical document as a root node, taking the reference technical document as a leaf node, connecting the root node and the leaf node according to the tracing route, and generating a tree diagram for representing the tracing result of the target technical document, wherein the tree diagram can be understood as a skeleton for forming a complete tracing tree. The present invention may further comprise: and correspondingly setting labels for all tracing routes in the tree diagram respectively, and marking all tracing routes in the tree diagram according to the influence value.
As shown in FIG. 3, a diagram of a traceability tree generated in one embodiment of the present invention for characterizing the traceability result of a target technical document is shown. Fig. 3 illustrates a tracing tree of the target paper "BERT", in which the top paper is "BERT", and the cited papers form a "tree", and the papers cited by "BERT" are arranged from top to bottom in a time sequence. Wherein, different types of papers are divided into different tracing routes, and each paper and each route can obtain the influence on the target paper through calculation.
The invention can realize automatic searching of relevant papers of pioneer work which have important influence on the target paper, realize the traceability analysis of the given academic paper, and comb into a traceability tree with clear arrangement, and perform source exploration in each relevant field direction.
As shown in fig. 4, the present invention can also provide a technical document tracing apparatus, which may include, but is not limited to, a document searching module, a vector creating module, a clustering module, a tracing route generating module, and a tree diagram generating module.
The document searching module is used for searching and obtaining a plurality of reference technical documents which have an association relation with the target technical document based on the given target technical document. It is understood that the target technology documents in the present invention may include, but are not limited to, papers, and the reference technology documents are papers directly referenced by the papers and/or papers indirectly referenced by the papers, and the association relationship is a reference relationship. Of course, the target technical document may also be a technical document such as a patent document, a periodical, and the like.
The vector creating module is used for creating a feature vector of each technical document, and the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents include a target technical document and a reference technical document.
The vector creating module specifically comprises a first creating submodule, a second creating submodule and a third creating submodule.
The first creating submodule is used for extracting text data in each technical document and creating a text feature vector by using the text data, and the text feature vector is used for representing text features of the technical document. The first creation submodule is used in particular for extracting a first vector from the text data in the manner of a word frequency-inverse text frequency index, for extracting a second vector from the text data in the manner of a sentence-from-transformer bidirectional encoder token, and for creating a text feature vector from the first vector and the second vector. And the second creating submodule is used for creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, and the graph feature vector is used for representing incidence features among different technical documents.
The first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. And the third creating sub-module is used for creating the feature vector of each technical document according to the text feature vector and the graph feature vector.
The clustering processing module is used for clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing module is specifically used for clustering the reference technical documents according to the feature vectors, the first incidence relation and the second incidence relation of the reference technical documents.
And the source tracing route generating module is used for arranging the reference technical documents according to the time relationship for each document set to form a source tracing route.
The technical document traceability device can comprise a label generation module and an influence calculation module. The label generation module can be used for respectively generating labels of all document sets according to the keyword information of the reference technology documents in the document sets. And the influence calculation module is used for calculating the influence value of each document set on the target technical document according to the feature vector of each technical document.
The tree diagram generating module is used for connecting the root node and the leaf node according to the tracing route by taking the target technical document as the root node and taking the reference technical document as the leaf node, and generating the tree diagram for representing the tracing result of the target technical document. The tree diagram generating module can also be used for correspondingly setting labels for the source tracing routes in the tree diagram. The tree diagram generation module can also be used for marking various tracing routes in the tree diagram according to the influence value.
As shown in fig. 5, the present invention may provide a computer device including a memory and a processor. The memory has stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the technical document tracing method in any embodiment of the present invention. The technical document tracing method may include, but is not limited to, at least one of the following steps. Step 100, a plurality of reference technical documents having an association relation with a target technical document are found based on a given target technical document. The target technical document can be a paper, the reference technical document is a paper directly quoted by the paper and/or a paper indirectly quoted by the paper, and the association relationship is a quote relationship. 200, creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and associated features between different technical documents; the technical documents include a target technical document and a reference technical document. Step 200 of the present invention may include steps 201 to 203, that is, creating the feature vector of each technical document includes: step 201, extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document. Wherein creating the text feature vector using the text data comprises: a first vector is extracted from text data in a word frequency-inverse text frequency index manner, a second vector is extracted from the text data in a sentence-from-transformer bidirectional encoder characterization amount manner, and a text feature vector is created according to the first vector and the second vector. Step 202, creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; the first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. And step 203, creating a feature vector of each technical document according to the text feature vector and the graph feature vector. Step 300, clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing of the reference technical documents based on the feature vectors of the invention can comprise: and clustering the reference technical documents according to the feature vectors, the first incidence relation and the second incidence relation of the reference technical documents. And 400, arranging the reference technical documents according to the time relationship to form a tracing route for each document set. And the labels of all the document sets can be respectively generated according to the keyword information of the reference technical documents in the document sets. And the influence value of each document set on the target technical document can be calculated according to the feature vector of each technical document. And 500, taking the target technical document as a root node, taking the reference technical document as a leaf node, connecting the root node and the leaf node according to the tracing route, and generating a tree diagram for representing the tracing result of the target technical document. The technical document tracing method can further comprise the following steps: and correspondingly setting labels for all tracing routes in the tree diagram respectively, and marking all tracing routes in the tree diagram according to the influence value.
The present invention can also provide a computer storage medium having computer readable instructions stored thereon, which when executed by a processor implement the steps of the technical document tracing method in any embodiment of the present invention. The technical document tracing method may include, but is not limited to, at least one of the following steps. Step 100, a plurality of reference technical documents having an association relation with a target technical document are found based on a given target technical document. The target technical document can be a paper, the reference technical document is a paper directly quoted by the paper and/or a paper indirectly quoted by the paper, and the association relationship is a quote relationship. 200, creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and associated features between different technical documents; the technical documents include a target technical document and a reference technical document. Step 200 of the present invention may include steps 201 to 203, that is, creating the feature vector of each technical document includes: step 201, extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document. Wherein creating the text feature vector using the text data comprises: a first vector is extracted from text data in a word frequency-inverse text frequency index manner, a second vector is extracted from the text data in a sentence-from-transformer bidirectional encoder characterization amount manner, and a text feature vector is created according to the first vector and the second vector. Step 202, creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; the first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. And step 203, creating a feature vector of each technical document according to the text feature vector and the graph feature vector. Step 300, clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing of the reference technical documents based on the feature vectors of the invention can comprise: and clustering the reference technical documents according to the feature vectors, the first incidence relation and the second incidence relation of the reference technical documents. And 400, arranging the reference technical documents according to the time relationship to form a tracing route for each document set. And the labels of all the document sets can be respectively generated according to the keyword information of the reference technical documents in the document sets. And the influence value of each document set on the target technical document can be calculated according to the feature vector of each technical document. And 500, taking the target technical document as a root node, taking the reference technical document as a leaf node, connecting the root node and the leaf node according to the tracing route, and generating a tree diagram for representing the tracing result of the target technical document. The technical document tracing method can further comprise the following steps: and correspondingly setting labels for all tracing routes in the tree diagram respectively, and marking all tracing routes in the tree diagram according to the influence value.
The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM-Only Memory, or flash Memory), an optical fiber device, and a portable Compact Disc Read-Only Memory (CDROM). Additionally, the computer-readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the description herein, references to the description of the term "the present embodiment," "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and simplifications made in the spirit of the present invention are intended to be included in the scope of the present invention.

Claims (9)

1. A technical document tracing method is characterized by comprising the following steps:
searching and obtaining a plurality of reference technical documents which have an incidence relation with a given target technical document based on the target technical document;
creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents comprise a target technical document and a reference technical document; the creating of the feature vector of each technical document comprises: extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document; creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; wherein the first association relationship is an association relationship between the target technical document and each reference technical document, and the second association relationship is an association relationship between different reference technical documents; creating a feature vector of each technical document according to the text feature vector and the graph feature vector;
clustering the reference technology documents based on the feature vectors to form a plurality of document sets;
for each document set, arranging the reference technical documents according to a time relation to form a tracing route;
and taking the target technical document as a root node, taking the reference technical document as a leaf node, and connecting the root node and the leaf node according to the tracing route to generate a tree diagram for representing the tracing result of the target technical document.
2. The technical document tracing method according to claim 1, further comprising:
respectively generating labels of all document sets according to the keyword information of the reference technical documents in the document sets;
and correspondingly setting labels for all the tracing routes in the tree diagram.
3. The technical document tracing method according to claim 1, further comprising:
calculating the influence value of each document set on the target technical document according to the feature vector of each technical document;
and marking each tracing route in the tree diagram according to the influence value.
4. The method of claim 1, wherein the creating text feature vectors using the text data comprises:
extracting a first vector from the text data based on a word frequency-inverse text frequency index mode;
extracting a second vector from the text data based on a sentence-from-transformer bidirectional encoder token manner;
creating the text feature vector from the first vector and the second vector.
5. The method of claim 4, wherein the clustering the reference technical documents based on the feature vectors comprises:
and clustering the reference technical documents according to the feature vectors of the reference technical documents, the first incidence relation and the second incidence relation.
6. The method according to claim 5, wherein the target technical document is a paper, the reference technical document is a paper directly referenced by the paper and/or a paper indirectly referenced by the paper, and the association relationship is a reference relationship.
7. A technical document tracing apparatus, comprising:
the document searching module is used for searching and obtaining a plurality of reference technical documents which have incidence relations with the target technical documents based on the given target technical documents;
the system comprises a vector creating module, a feature vector generating module and a feature vector generating module, wherein the vector creating module is used for creating a feature vector of each technical document, and the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents comprise a target technical document and a reference technical document;
the vector creating module specifically comprises a first creating submodule, a second creating submodule and a third creating submodule;
the first creating submodule is used for extracting text data in each technical document and creating a text feature vector by using the text data, and the text feature vector is used for representing text features of the technical document;
the second creating submodule is used for creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, and the graph feature vector is used for representing incidence features among different technical documents;
the first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents;
the third creating submodule is used for creating the feature vector of each technical document according to the text feature vector and the graph feature vector;
the clustering processing module is used for clustering the reference technical documents based on the characteristic vectors to form a plurality of document sets;
the tracing route generating module is used for arranging the reference technical documents according to the time relationship for each document set to form a tracing route;
and the tree diagram generating module is used for connecting the root node and the leaf nodes according to the tracing route by taking the target technical document as the root node and the reference technical document as the leaf nodes to generate a tree diagram for representing the tracing result of the target technical document.
8. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, cause the processor to perform the steps of the technical document tracing method according to any one of claims 1 to 6.
9. A computer storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the steps of the technical document tracing method according to any one of claims 1 to 6.
CN202011337966.6A 2020-11-25 2020-11-25 Technical document tracing method and device, computer equipment and computer storage medium Active CN112487160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011337966.6A CN112487160B (en) 2020-11-25 2020-11-25 Technical document tracing method and device, computer equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011337966.6A CN112487160B (en) 2020-11-25 2020-11-25 Technical document tracing method and device, computer equipment and computer storage medium

Publications (2)

Publication Number Publication Date
CN112487160A CN112487160A (en) 2021-03-12
CN112487160B true CN112487160B (en) 2022-01-04

Family

ID=74934600

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011337966.6A Active CN112487160B (en) 2020-11-25 2020-11-25 Technical document tracing method and device, computer equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN112487160B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399488A (en) * 2019-07-05 2019-11-01 深圳和而泰家居在线网络科技有限公司 File classification method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108090178B (en) * 2017-12-15 2020-08-25 北京锐安科技有限公司 Text data analysis method, text data analysis device, server and storage medium
CN111274145A (en) * 2020-01-20 2020-06-12 深圳壹账通智能科技有限公司 Relationship structure chart generation method and device, computer equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399488A (en) * 2019-07-05 2019-11-01 深圳和而泰家居在线网络科技有限公司 File classification method and device

Also Published As

Publication number Publication date
CN112487160A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN110399457B (en) Intelligent question answering method and system
US11048882B2 (en) Automatic semantic rating and abstraction of literature
Velardi et al. Ontolearn reloaded: A graph-based algorithm for taxonomy induction
US8484245B2 (en) Large scale unsupervised hierarchical document categorization using ontological guidance
US10460162B2 (en) Method, device, and system, for identifying data elements in data structures
JP2005526317A (en) Method and system for automatically searching a concept hierarchy from a document corpus
US20150006528A1 (en) Hierarchical data structure of documents
US20120078969A1 (en) System and method to extract models from semi-structured documents
CN107463548A (en) Short phrase picking method and device
US11551151B2 (en) Automatically generating a pipeline of a new machine learning project from pipelines of existing machine learning projects stored in a corpus
CN108304382A (en) Mass analysis method based on manufacturing process text data digging and system
CN109508448A (en) Short information method, medium, device are generated based on long article and calculate equipment
EP3968244A1 (en) Automatically curating existing machine learning projects into a corpus adaptable for use in new machine learning projects
CN114239588A (en) Article processing method and device, electronic equipment and medium
US20090234852A1 (en) Sub-linear approximate string match
CN116186381A (en) Intelligent retrieval recommendation method and system
Bowker et al. Information science, terminology and translation studies
CN112487160B (en) Technical document tracing method and device, computer equipment and computer storage medium
CN114207598A (en) Electronic form conversion
JP5679400B2 (en) Category theme phrase extracting device, hierarchical tagging device and method, program, and computer-readable recording medium
US20220067576A1 (en) Automatically labeling functional blocks in pipelines of existing machine learning projects in a corpus adaptable for use in new machine learning projects
CN114328895A (en) News abstract generation method and device and computer equipment
Lamba et al. Tools and techniques for text mining and visualization
Wang et al. A hybrid approach for tag hierarchy construction
Azeroual A text and data analytics approach to enrich the quality of unstructured research information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210326

Address after: 100084 b201c-1, 3rd floor, building 8, yard 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: Beijing innovation Zhiyuan Technology Co.,Ltd.

Address before: B201d-1, 3rd floor, building 8, yard 1, Zhongguancun East Road, Haidian District, Beijing 100083

Applicant before: Beijing Zhiyuan Artificial Intelligence Research Institute

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210621

Address after: 100084 603a, 6th floor, building 6, yard 1, Zhongguancun East Road, Haidian District, Beijing

Applicant after: Beijing Zhipu Huazhang Technology Co.,Ltd.

Address before: 100084 b201c-1, 3rd floor, building 8, yard 1, Zhongguancun East Road, Haidian District, Beijing

Applicant before: Beijing innovation Zhiyuan Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Yin Da

Inventor after: Tan Yonglin

Inventor after: Ding Ming

Inventor after: Liu Debing

Inventor after: Qiu Yu

Inventor before: Yin Da

Inventor before: Tan Yonglin

Inventor before: Ding Ming

Inventor before: Tang Jie

Inventor before: Liu Debing

Inventor before: Qiu Yu