CN112487160B

CN112487160B - Technical document tracing method and device, computer equipment and computer storage medium

Info

Publication number: CN112487160B
Application number: CN202011337966.6A
Authority: CN
Inventors: 殷达; 谭咏霖; 丁铭; 唐杰; 刘德兵; 仇瑜
Original assignee: Beijing Zhipu Huazhang Technology Co Ltd
Current assignee: Beijing Zhipu Huazhang Technology Co ltd
Priority date: 2020-11-25
Filing date: 2020-11-25
Publication date: 2022-01-04
Anticipated expiration: 2040-11-25
Also published as: CN112487160A

Abstract

The invention can provide a technical document tracing method and device, computer equipment and a computer storage medium. The technical document source tracing method can comprise the following steps: and searching a plurality of reference technical documents with association relation based on the target technical document. A feature vector is created for each technical document, the feature vector being used to characterize textual features of the technical document and associated features between different technical documents. Clustering the reference technology documents based on the feature vectors to form a plurality of document sets. And for each document set, arranging the reference technical documents according to the time relationship to form a tracing route. And generating a tree diagram for representing the source tracing result of the target technical document. The method can perform traceability analysis on the technical documents, dig out influence relations and useful information among the technical documents, quickly generate the traceability tree representing the traceability result of the target technical document, vividly depict the evolution process of technology or thought, and meet the requirement that scientific researchers and other users locate the content which really needs to be deeply read.

Description

Technical document tracing method and device, computer equipment and computer storage medium

Technical Field

The invention relates to the technical field of technical document processing, in particular to a technical document tracing method and device, computer equipment and a computer storage medium.

Background

With the increasing proliferation of academic research, more and more academic papers are being produced at faster and faster rates. For researchers, students or enthusiasts stepping into a new field, a large amount of retrieval and reading of relevant contents of basic knowledge points mentioned in an academic paper are often needed when reading the academic paper, a large amount of time is needed in the retrieval process due to low familiarity with the new field, problems of reading of irrelevant materials due to wrong learning direction and the like often occur, and the efficiency of obtaining useful knowledge is low. For leading-edge researchers, it is sometimes necessary to summarize and summarize a technical development process to re-innovate, this process requires leading-edge researchers to manually analyze a large amount of technical documents such as academic papers, often occupying a large amount of precious time, and the manual analysis process has too much dependence on factors such as experience of the personnel and subjective attention degree.

Therefore, how to effectively assist the user such as a researcher or a leading-edge technologist to improve the efficiency of acquiring useful knowledge and shorten the time spent in the search process and the manual analysis time as much as possible becomes a key point of the technical problem to be solved and the research in the future by the technical staff in the field.

Disclosure of Invention

In order to solve the problems of low useful knowledge acquisition efficiency, long time for manually analyzing technical documents and the like in the prior art, the invention provides a technical document tracing method and device, computer equipment and a computer storage medium, and aims to assist users in improving knowledge acquisition efficiency, shortening retrieval and analysis time and the like.

To achieve the above technical objects, one or more embodiments of the present invention can provide a technical document tracing method, which may include, but is not limited to, at least one of the following steps.

And searching and obtaining a plurality of reference technical documents which have an association relation with the target technical document based on the given target technical document.

Creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents include a target technical document and a reference technical document.

Clustering the reference technology documents based on the feature vectors to form a plurality of document sets.

And for each document set, arranging the reference technical documents according to the time relationship to form a tracing route.

And taking the target technical document as a root node, taking the reference technical document as a leaf node, and connecting the root node and the leaf node according to the tracing route to generate a tree diagram for representing the tracing result of the target technical document.

Further, the technical document tracing method further comprises the following steps:

and respectively generating the label of each document set according to the keyword information of the reference technical document in the document set.

And correspondingly setting labels for all the tracing routes in the tree diagram.

and calculating the influence value of each document set on the target technical document according to the feature vector of each technical document.

And marking each tracing route in the tree diagram according to the influence value.

Further, the creating of the feature vector of each technical document includes:

and extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document.

Creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; the first association relationship is an association relationship between the target technical document and each reference technical document, and the second association relationship is an association relationship between different reference technical documents.

And creating a feature vector of each technical document according to the text feature vector and the graph feature vector.

Further, the creating a text feature vector using the text data comprises:

and extracting a first vector from the text data based on a word frequency-inverse text frequency index mode.

A second vector is extracted from the text data based on the way the sentence-from-transformer bidirectional encoder tokens.

Creating the text feature vector from the first vector and the second vector.

Further, the clustering the reference technology documents based on the feature vectors comprises:

and clustering the reference technical documents according to the feature vectors of the reference technical documents, the first incidence relation and the second incidence relation.

Further, the target technology document is a paper, the reference technology document is a paper directly referenced by the paper and/or a paper indirectly referenced by the paper, and the association relationship is a reference relationship.

To achieve the above technical objects, one or more embodiments of the present invention may further provide a technical document tracing apparatus, which may include, but is not limited to, a document searching module, a vector creating module, a clustering module, a tracing route generating module, and a tree diagram generating module.

And the document searching module is used for searching and obtaining a plurality of reference technical documents which have incidence relations with the target technical documents based on the given target technical documents.

The system comprises a vector creating module, a feature vector generating module and a feature vector generating module, wherein the vector creating module is used for creating a feature vector of each technical document, and the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents include a target technical document and a reference technical document.

And the clustering processing module is used for clustering the reference technical documents based on the characteristic vectors to form a plurality of document sets.

And the source tracing route generating module is used for arranging the reference technical documents according to the time relationship for each document set to form a source tracing route.

And the tree diagram generating module is used for connecting the root node and the leaf nodes according to the tracing route by taking the target technical document as the root node and the reference technical document as the leaf nodes to generate a tree diagram for representing the tracing result of the target technical document.

To achieve the above technical object, the present invention can also provide a computer device, which includes a memory and a processor, wherein the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the technical document tracing method in any embodiment of the present invention.

To achieve the above technical objects, the present invention may also provide a computer storage medium having computer readable instructions stored thereon, which when executed by a processor implement the steps of the technical document tracing method according to any embodiment of the present invention.

The invention has the beneficial effects that: the method can perform traceability analysis on the given technical documents, dig out influence relations and useful information among the technical documents, and quickly generate the traceability tree representing the traceability result of the target technical document, so as to vividly depict the evolution process of the technology or thought in a paper map mode, and meet the requirement of accurately, quickly and intuitively positioning the content which really needs to be deeply read by users such as scientific researchers.

For researchers, students or enthusiasts stepping into a new field, the invention can provide relevant papers of relevant knowledge points related to academic papers to be learned for the users, greatly reduces the time spent on user retrieval and saves the time of reading documents with lower relevance. For the leading-edge scientific researchers, the method and the system can provide the source-tracing analysis and evolution process of one or more technologies for the leading-edge scientific researchers, assist the users in accurately and quickly summarizing the technology development and the technology evolution, and help the inspiring users to reveal the possible next potential technical innovation point. Compared with the conventional technology, the method and the device thoroughly solve the problems of low useful knowledge acquisition efficiency, overlong time for manually analyzing technical documents and the like, and provide great help for users such as leading-edge scientific researchers, students and enthusiasts.

Drawings

FIG. 1 is a flow diagram illustrating a method for tracing a technical document according to one or more embodiments of the invention.

FIG. 2 illustrates a flow diagram for forming a complete technical document traceability tree in one or more embodiments of the present invention.

FIG. 3 is a diagram illustrating a complete paper traceability tree generated in one embodiment of the present invention for characterizing a target paper traceability result.

FIG. 4 is a diagram illustrating the components of a technical document tracing apparatus in one or more embodiments of the invention.

FIG. 5 shows a schematic diagram of the internal structure of a computer device in one or more embodiments of the invention.

Detailed Description

The technical document tracing method and apparatus, the computer device, and the computer storage medium provided by the present invention are explained and explained in detail below with reference to the drawings of the specification.

As shown in fig. 1, in conjunction with fig. 2, one or more embodiments of the present invention can provide a technical document tracing method that can form a complete technical document tracing tree for characterizing a target technical document tracing result, and the technical document tracing method can include, but is not limited to, at least one of the following steps.

Step 100, a plurality of reference technical documents having an association relation with a target technical document are found based on a given target technical document. The target technical document can be a paper, the reference technical document is a paper directly quoted by the paper and/or a paper indirectly quoted by the paper, and the association relationship is a quote relationship. Therefore, the method can be used for automatically analyzing the evolution root of the academic thesis, and the purposes of automatically analyzing the thought root of the academic thesis and the like are achieved. Of course, the target technical document may also be a technical document such as a patent document, a periodical, and the like.

Taking a paper as an example, the invention can find the paper quoted by the target paper from the paper library. The method can be carried out on a thesis citation relation network according to breadth-first search starting from a target thesis, and after the search at the same depth is finished, if the number of the obtained thesis sets is more than the set number of cited thesis, the search is stopped; otherwise, the search is continued by deepening one layer until the set number of the cited papers can be reached. In the embodiment of the present invention, when searching for a reference technology document (e.g., a cited paper), if the number of the obtained paper sets is greater than the set number of cited papers, the method further includes a step of screening papers cited by the target paper. The screening step comprises: sequencing the obtained reference technical documents in a mode of reconstructing the obtained candidate papers into a citation relation network and calculating the score of each paper, wherein the score calculation method can be but is not limited to a webpage ranking (PageRank) algorithm, and the score calculation method can also be the cited times of the papers and the like; then, sorting the candidate papers according to the order and the score, wherein the sorting mode can be that the paper with small order is in the front, the paper with large order is in the back, and the paper with the same order is in the front with high score; and finally, screening out a set number of candidate papers from front to back according to the sorting result.

In the invention, the paper directly referenced by the paper forms a set R1, the paper indirectly referenced by the paper forms a set R2, and the paper indirectly referenced by the paper comprises second-order reference and higher-order reference. It is understood that the paper in set R2 is a paper referenced directly or indirectly by the paper in set R1. Some embodiments of the present invention can focus on an important paper having a closer relationship with a target paper, so that the number of cited papers in the finally formed paper tracing tree can be set to be 100 in a default configuration.

As shown in fig. 2, the papers having important influence on the target paper, the papers directly referenced by the target paper such as GPT [ Radford, 2018], ELMo [ Peters, 2018], GloVe [ Pennington, 2014], and the like, and the papers indirectly referenced by the target paper such as Seq2Seq [ Sutskever, 2014 ].

Step 200, creating a feature vector of each technical document, wherein the feature vector is used for representing the text features of the technical documents and the associated features between different technical documents. The technical documents include a target technical document and a reference technical document. The method converts the traceability process of the target technical document into the calculation process of the feature vector, and obtains the paper traceability tree based on each feature vector. Taking a paper q as an example, the present invention can be abstracted as a paper traceability tree (V, E, C, W) for computing the paper q. Wherein, V represents nodes of the tracing tree, and each node is a related citation paper; e represents the edges of the traceback tree, each edge representing a potential route of the evolution of the paper; c represents a number of tracing routes, each route including some representative labels referring to papers and their contents; w represents an influence value (or score), and each cited paper contains a value (or score) for the influence of the target paper.

The step 200 of the present invention may include steps 201 to 203, that is, creating the feature vector of each technical document includes the following steps 201 to 203.

Step 201, extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document. Wherein creating the text feature vector using the text data comprises: extracting a first vector from text data based on a term frequency-inverse text frequency index (TF-IDF) mode, wherein the invention can extract a sparse vector by using the term frequency-inverse text frequency index mode, and the sparse vector is used as the first vector; moreover, word stem processing can be carried out during text data processing, so that words with the same meaning in different forms are changed into the same form, for example, ding, done and did are unified into do; it is also possible to extract all words and phrases at a set length (e.g., between 1-5 words) using an n-gram method, and only the set number (e.g., 2000) of words with the highest frequency of occurrence can be calculated. Extracting a second vector from the text data based on a Sentence-Bidirectional Encoder representation (S-BERT) mode from the transformer, wherein the extraction mode can be coding a text sequence to obtain the second vector; a fixed number (e.g., 512) of words in the text data (e.g., abstract) may be directly intercepted as input during the encoding process.

The text feature vector is created according to the first vector and the second vector, the text feature vector can be obtained by directly combining the first vector and the second vector, and the first vector and the second vector can be combined after denoising and the like. The text feature vector can accurately express the content of the technical document from the language level, and the invention fully considers the information of the content of the technical document and realizes the deep excavation of the technical content.

Taking the paper as an example, the invention can extract text data from the title, abstract and other contents of the paper, and certainly is not limited to the title and abstract.

Step 202, creating a graph feature vector based on the text feature vector, the first association relation and the second association relation, wherein the graph feature vector is used for representing association features among different technical documents, and the graph feature vector can express technical document contents from a mutual association structure level among reference technical documents. The first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. In particular, some embodiments of the invention can use a fast graph embedding representation method (ProNE) to obtain graph feature vectors: the first step is that fast Singular Value Decomposition (SVD) Decomposition is carried out on an adjacent matrix of the graph, and the adjacent matrix of the graph can be used for representing a first incidence relation and a second incidence relation, so that an initial vector of each node in the graph is obtained; and the second step is to filter the adjacent matrix of the graph in a spectrum space, and then perform feature propagation on the filtered adjacent matrix based on the initial vector obtained in the first step, wherein the content of the feature propagation can be specifically the content in the text feature vector, so that the graph feature vector can be obtained.

And step 203, creating a feature vector of each technical document according to the text feature vector and the graph feature vector. The feature vectors of the technical documents can be obtained through iterative computation, and the iterative computation is understood as a feature propagation mode, such as being implemented by using a Propagate function (propagation function).

Wherein x is_oA feature vector representing a technical document,

a text feature vector representing a technical document,

a graph feature vector representing a technical document.

Step 300, clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing of the reference technical documents based on the feature vectors of the invention can comprise: clustering processing is carried out on the reference technical documents by taking the characteristic vectors, the first incidence relation and the second incidence relation of the reference technical documents as the basis, the invention can specifically adopt a Kernel K-mean clustering (Kernel K-means) algorithm to cluster the reference technical documents, and the Euclidean distance is taken as the basis of clustering. Calculate each point x_iRelative to class C_tCenter of a ship

The euclidean distance of (a) may be expressed as:

wherein the content of the first and second substances,

the goal is to have the points with a close euclidean distance be grouped into the same class. Second term α A_ij(default α ═ 1.0) where A is the adjacency matrix, A_ijPresentation paper p_iAnd paper p_jWhether there is a reference relationship between them is emphasized by the spectral clustering adopted in this embodiment, and the setting A is_ijThe goal is to have points with reference relationships grouped into the same class as much as possible. Third term [ beta ]_ij(default β ═ 1.0) is an additional constraint term, which can be set as the case may be.

Taking fig. 2 as an example, QANet [ Yu, 2018] and sqaad [ Rajpurkar, 2016] are model papers and data set papers in the machine reading field direction, respectively, and both papers refer to "reading", "query answer", "sqaad", and other machine reading understanding related words in the abstract, and there are many common neighbor nodes in the reference relationship network, such as trivia qa [ Joshi, 2017] and U-Net [ Sun, 2018], QANet [ Yu, 2018] and sqaad [ Rajpurkar, 2016] in the feature space, and therefore, they are grouped in the same cluster. Similarly, the articles Attention, GoogleNMT, Seq2Seq, etc. in the field of machine translation are also classified into one category.

And 400, arranging the reference technical documents according to the time relationship to form a tracing route for each document set. Taking the paper as an example, sorting is performed according to the reference order and the time sequence. As shown in fig. 2 and 3, papers in the same category can be linked into two timelines, one main timeline is composed of direct references, and the other secondary timeline is composed of indirect references. The latest published papers in the secondary timeline are linked with nodes on the primary timeline of the same period, thereby connecting the papers of the whole category. The latest published main timeline paper in all the categories is further connected with the target paper, so that all the cited papers and the target paper are connected into a paper source tracing tree which takes the target paper as a root node and the cited papers as leaf nodes.

As shown in fig. 2, Seq2Seq is not divided into the main timeline in which the Attention and google nmt are located as an indirect reference paper, but appears in the sub timeline and is linked to the nearest google nmt node of the main timeline. Three newly published papers QANT, GPT, Attention of different categories are all connected to the target paper (taking target paper "BERT" as an example), thereby forming a complete traceable tree framework.

The invention can respectively generate the label of each document set according to the keyword information of the reference technology document in the document set. The label may be, for example, category-related information in the cluster, and may be, for example, a word with a high frequency of occurrence in each of the plurality of reference technical documents. In some embodiments of the present invention, the tags in the document set may be determined by a word distribution co-occurrence selection manner, and it may be understood that, in the tag selection process, all texts in the reference technical documents in each category may be regarded as a first word distribution, and each tag forms a second word distribution according to co-occurrence with other words.

As shown in FIG. 2, the tags of the categories QANT and SQuAD are "read comprehension", while the tags of the categories of Attention, GoogleNMT and Seq2Seq are "machine translation", and the tags of the categories of GPT, ELMo and GloVe are "language models".

The invention can also calculate the influence value of each document set on the target technical document according to the feature vector of each technical document.

Specifically, the invention can be used for clustering K used in the process_ijAs a reference technical document p_iInfluence on the target technical document q, influence value

Wherein i_qRepresents the subscript of the target technical document q. It can be understood that the invention can also calculate the influence value of each category on the basis of the reference technical document

The influence of each category describes the degree of influence of the traceable route on the target technical document.

As shown in fig. 2, some embodiments of the present invention may represent the magnitude of the influence value by the shade of the color and the thickness of the tracing route. The tracing routes of QANT and SquAD under the label of 'reading understanding' are thicker, and the influence degree of the tracing routes on the target technical document is considered to be larger. The markup color of the Attention and GoogleNMT under the "machine translation" tag is darker, and the extent of the influence of the Attention and GoogleNMT technical documents on the target technical documents can be considered to be greater.

And 500, taking the target technical document as a root node, taking the reference technical document as a leaf node, connecting the root node and the leaf node according to the tracing route, and generating a tree diagram for representing the tracing result of the target technical document, wherein the tree diagram can be understood as a skeleton for forming a complete tracing tree. The present invention may further comprise: and correspondingly setting labels for all tracing routes in the tree diagram respectively, and marking all tracing routes in the tree diagram according to the influence value.

As shown in FIG. 3, a diagram of a traceability tree generated in one embodiment of the present invention for characterizing the traceability result of a target technical document is shown. Fig. 3 illustrates a tracing tree of the target paper "BERT", in which the top paper is "BERT", and the cited papers form a "tree", and the papers cited by "BERT" are arranged from top to bottom in a time sequence. Wherein, different types of papers are divided into different tracing routes, and each paper and each route can obtain the influence on the target paper through calculation.

The invention can realize automatic searching of relevant papers of pioneer work which have important influence on the target paper, realize the traceability analysis of the given academic paper, and comb into a traceability tree with clear arrangement, and perform source exploration in each relevant field direction.

As shown in fig. 4, the present invention can also provide a technical document tracing apparatus, which may include, but is not limited to, a document searching module, a vector creating module, a clustering module, a tracing route generating module, and a tree diagram generating module.

The document searching module is used for searching and obtaining a plurality of reference technical documents which have an association relation with the target technical document based on the given target technical document. It is understood that the target technology documents in the present invention may include, but are not limited to, papers, and the reference technology documents are papers directly referenced by the papers and/or papers indirectly referenced by the papers, and the association relationship is a reference relationship. Of course, the target technical document may also be a technical document such as a patent document, a periodical, and the like.

The vector creating module is used for creating a feature vector of each technical document, and the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents include a target technical document and a reference technical document.

The vector creating module specifically comprises a first creating submodule, a second creating submodule and a third creating submodule.

The first creating submodule is used for extracting text data in each technical document and creating a text feature vector by using the text data, and the text feature vector is used for representing text features of the technical document. The first creation submodule is used in particular for extracting a first vector from the text data in the manner of a word frequency-inverse text frequency index, for extracting a second vector from the text data in the manner of a sentence-from-transformer bidirectional encoder token, and for creating a text feature vector from the first vector and the second vector. And the second creating submodule is used for creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, and the graph feature vector is used for representing incidence features among different technical documents.

The first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. And the third creating sub-module is used for creating the feature vector of each technical document according to the text feature vector and the graph feature vector.

The clustering processing module is used for clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing module is specifically used for clustering the reference technical documents according to the feature vectors, the first incidence relation and the second incidence relation of the reference technical documents.

The technical document traceability device can comprise a label generation module and an influence calculation module. The label generation module can be used for respectively generating labels of all document sets according to the keyword information of the reference technology documents in the document sets. And the influence calculation module is used for calculating the influence value of each document set on the target technical document according to the feature vector of each technical document.

The tree diagram generating module is used for connecting the root node and the leaf node according to the tracing route by taking the target technical document as the root node and taking the reference technical document as the leaf node, and generating the tree diagram for representing the tracing result of the target technical document. The tree diagram generating module can also be used for correspondingly setting labels for the source tracing routes in the tree diagram. The tree diagram generation module can also be used for marking various tracing routes in the tree diagram according to the influence value.

As shown in fig. 5, the present invention may provide a computer device including a memory and a processor. The memory has stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the technical document tracing method in any embodiment of the present invention. The technical document tracing method may include, but is not limited to, at least one of the following steps. Step 100, a plurality of reference technical documents having an association relation with a target technical document are found based on a given target technical document. The target technical document can be a paper, the reference technical document is a paper directly quoted by the paper and/or a paper indirectly quoted by the paper, and the association relationship is a quote relationship. 200, creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and associated features between different technical documents; the technical documents include a target technical document and a reference technical document. Step 200 of the present invention may include steps 201 to 203, that is, creating the feature vector of each technical document includes: step 201, extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document. Wherein creating the text feature vector using the text data comprises: a first vector is extracted from text data in a word frequency-inverse text frequency index manner, a second vector is extracted from the text data in a sentence-from-transformer bidirectional encoder characterization amount manner, and a text feature vector is created according to the first vector and the second vector. Step 202, creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; the first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. And step 203, creating a feature vector of each technical document according to the text feature vector and the graph feature vector. Step 300, clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing of the reference technical documents based on the feature vectors of the invention can comprise: and clustering the reference technical documents according to the feature vectors, the first incidence relation and the second incidence relation of the reference technical documents. And 400, arranging the reference technical documents according to the time relationship to form a tracing route for each document set. And the labels of all the document sets can be respectively generated according to the keyword information of the reference technical documents in the document sets. And the influence value of each document set on the target technical document can be calculated according to the feature vector of each technical document. And 500, taking the target technical document as a root node, taking the reference technical document as a leaf node, connecting the root node and the leaf node according to the tracing route, and generating a tree diagram for representing the tracing result of the target technical document. The technical document tracing method can further comprise the following steps: and correspondingly setting labels for all tracing routes in the tree diagram respectively, and marking all tracing routes in the tree diagram according to the influence value.

The present invention can also provide a computer storage medium having computer readable instructions stored thereon, which when executed by a processor implement the steps of the technical document tracing method in any embodiment of the present invention. The technical document tracing method may include, but is not limited to, at least one of the following steps. Step 100, a plurality of reference technical documents having an association relation with a target technical document are found based on a given target technical document. The target technical document can be a paper, the reference technical document is a paper directly quoted by the paper and/or a paper indirectly quoted by the paper, and the association relationship is a quote relationship. 200, creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and associated features between different technical documents; the technical documents include a target technical document and a reference technical document. Step 200 of the present invention may include steps 201 to 203, that is, creating the feature vector of each technical document includes: step 201, extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document. Wherein creating the text feature vector using the text data comprises: a first vector is extracted from text data in a word frequency-inverse text frequency index manner, a second vector is extracted from the text data in a sentence-from-transformer bidirectional encoder characterization amount manner, and a text feature vector is created according to the first vector and the second vector. Step 202, creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; the first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents. And step 203, creating a feature vector of each technical document according to the text feature vector and the graph feature vector. Step 300, clustering the reference technology documents based on the feature vectors to form a plurality of document sets. The clustering processing of the reference technical documents based on the feature vectors of the invention can comprise: and clustering the reference technical documents according to the feature vectors, the first incidence relation and the second incidence relation of the reference technical documents. And 400, arranging the reference technical documents according to the time relationship to form a tracing route for each document set. And the labels of all the document sets can be respectively generated according to the keyword information of the reference technical documents in the document sets. And the influence value of each document set on the target technical document can be calculated according to the feature vector of each technical document. And 500, taking the target technical document as a root node, taking the reference technical document as a leaf node, connecting the root node and the leaf node according to the tracing route, and generating a tree diagram for representing the tracing result of the target technical document. The technical document tracing method can further comprise the following steps: and correspondingly setting labels for all tracing routes in the tree diagram respectively, and marking all tracing routes in the tree diagram according to the influence value.

The logic and/or steps represented in the flowcharts or otherwise described herein, such as an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM-Only Memory, or flash Memory), an optical fiber device, and a portable Compact Disc Read-Only Memory (CDROM). Additionally, the computer-readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic Gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic Gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "the present embodiment," "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and simplifications made in the spirit of the present invention are intended to be included in the scope of the present invention.

Claims

1. A technical document tracing method is characterized by comprising the following steps:

searching and obtaining a plurality of reference technical documents which have an incidence relation with a given target technical document based on the target technical document;

creating a feature vector of each technical document, wherein the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents comprise a target technical document and a reference technical document; the creating of the feature vector of each technical document comprises: extracting text data in each technical document, and creating a text feature vector by using the text data, wherein the text feature vector is used for representing text features of the technical document; creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, wherein the graph feature vector is used for representing incidence features among different technical documents; wherein the first association relationship is an association relationship between the target technical document and each reference technical document, and the second association relationship is an association relationship between different reference technical documents; creating a feature vector of each technical document according to the text feature vector and the graph feature vector;

clustering the reference technology documents based on the feature vectors to form a plurality of document sets;

for each document set, arranging the reference technical documents according to a time relation to form a tracing route;

2. The technical document tracing method according to claim 1, further comprising:

respectively generating labels of all document sets according to the keyword information of the reference technical documents in the document sets;

3. The technical document tracing method according to claim 1, further comprising:

calculating the influence value of each document set on the target technical document according to the feature vector of each technical document;

4. The method of claim 1, wherein the creating text feature vectors using the text data comprises:

extracting a first vector from the text data based on a word frequency-inverse text frequency index mode;

extracting a second vector from the text data based on a sentence-from-transformer bidirectional encoder token manner;

creating the text feature vector from the first vector and the second vector.

5. The method of claim 4, wherein the clustering the reference technical documents based on the feature vectors comprises:

6. The method according to claim 5, wherein the target technical document is a paper, the reference technical document is a paper directly referenced by the paper and/or a paper indirectly referenced by the paper, and the association relationship is a reference relationship.

7. A technical document tracing apparatus, comprising:

the document searching module is used for searching and obtaining a plurality of reference technical documents which have incidence relations with the target technical documents based on the given target technical documents;

the system comprises a vector creating module, a feature vector generating module and a feature vector generating module, wherein the vector creating module is used for creating a feature vector of each technical document, and the feature vector is used for representing text features of the technical documents and association features between different technical documents; the technical documents comprise a target technical document and a reference technical document;

the vector creating module specifically comprises a first creating submodule, a second creating submodule and a third creating submodule;

the first creating submodule is used for extracting text data in each technical document and creating a text feature vector by using the text data, and the text feature vector is used for representing text features of the technical document;

the second creating submodule is used for creating a graph feature vector based on the text feature vector, the first incidence relation and the second incidence relation, and the graph feature vector is used for representing incidence features among different technical documents;

the first association relationship is the association relationship between the target technical document and each reference technical document, and the second association relationship is the association relationship between different reference technical documents;

the third creating submodule is used for creating the feature vector of each technical document according to the text feature vector and the graph feature vector;

the clustering processing module is used for clustering the reference technical documents based on the characteristic vectors to form a plurality of document sets;

the tracing route generating module is used for arranging the reference technical documents according to the time relationship for each document set to form a tracing route;

8. A computer device comprising a memory and a processor, the memory having stored therein computer-readable instructions which, when executed by the processor, cause the processor to perform the steps of the technical document tracing method according to any one of claims 1 to 6.

9. A computer storage medium having computer readable instructions stored thereon, which when executed by a processor, implement the steps of the technical document tracing method according to any one of claims 1 to 6.