CN112989790B

CN112989790B - Document characterization method and device based on deep learning, equipment and storage medium

Info

Publication number: CN112989790B
Application number: CN202110287711.1A
Authority: CN
Inventors: 程章林; 杨之光; 奥利夫·马丁·多伊森; 潘光凡
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2021-03-17
Filing date: 2021-03-17
Publication date: 2023-02-28
Anticipated expiration: 2041-03-17
Also published as: CN112989790A

Abstract

The invention provides a document characterization method, a document characterization device, document characterization equipment and a storage medium based on deep learning, wherein the method comprises the following steps: analyzing the document to be characterized to obtain a keyword, an author list and a plurality of text messages of the document to be characterized; inputting each text message and the keywords into a network model combined with a keyword attention mechanism to obtain a first feature vector of each text message; sequentially inputting the author list and each piece of text information into the first feature extraction model to obtain the author list and a second feature vector of each piece of text information; and inputting the first feature vector and the second feature vector into a fusion network model for fusion to obtain a characterization vector of the document to be characterized. The document characterization method provided by the invention fully utilizes the keyword information, simultaneously considers a plurality of text data of the document and adopts different feature extraction methods aiming at different text data, thereby effectively improving the precision of vectorization characterization of the document.

Description

Document characterization method and device based on deep learning, equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a deep learning-based document characterization method, a deep learning-based document characterization device, deep learning-based document characterization equipment and a deep learning-based storage medium.

Background

The rapid increase of the number of literatures provides a great challenge for current scientific researchers, and the problem that how to rapidly screen high-quality literatures and how to rapidly understand and analyze the literatures is urgently needed to be solved by the scientific researchers. Professional researchers typically solve this problem by classifying, retrieving, recommending, and automatically generating summaries of documents, and the document characterization (Paper Representation) is an indispensable first step in the document processing tasks. In short, document characterization is to generate a mathematically vectorized representation for each document, convert the document of unstructured data into a structured vector, and measure the similarity between different documents with the vector for each document processing task downstream. Therefore, how to better use the vector representation literature is an important direction for improving the processing task effect of various literatures.

The vector characterization method of the current literature mainly comprises Author2Vec and Cite2Vec, wherein Author2Vec constructs Author vector characterization through an Author cooperation network and an article abstract, so that a paper is expressed and can be used for paper classification, paper recommendation and the like; cite2Vec constructs vector representation of the cited documents by using abstract information of the cited papers based on Word2Vec, and can be used for semantic representation and semantic retrieval of the documents. However, author2Vec and Cite2Vec do not consider other text data, and the conversion from the Author vector to the document vector in Author2Vec is simple and rough, and a large amount of information is lost, while the abstract in Cite2Vec uses Word2Vec to extract information, and cannot be applied to the situation of meaning of a Word.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a document characterization method, a document characterization device and a storage medium based on deep learning, a plurality of text data of a document are considered at the same time, different feature extraction methods are adopted for different text data, and the precision of document vectorization characterization can be effectively improved.

The specific technical scheme provided by the invention is as follows: a document characterization method based on deep learning, the document characterization method comprising:

analyzing a document to be characterized to obtain a keyword, an author list and a plurality of text messages of the document to be characterized;

respectively inputting each text message and the keywords in the plurality of text messages into a network model combined with a keyword attention mechanism to obtain a first feature vector corresponding to each text message;

sequentially inputting the author list and each piece of text information in the plurality of pieces of text information into a first feature extraction model, and respectively obtaining a second feature vector corresponding to the author list and each piece of text information;

and inputting the first feature vector and the second feature vector corresponding to each text message and the second feature vector corresponding to the author list into a fusion network model for fusion to obtain the feature vector of the document to be characterized.

Further, the network model combining the keyword attention mechanism includes a second feature extraction model, a first pooling layer, a keyword feature extraction layer, and a second pooling layer, and respectively inputs each text message and keyword of the plurality of text messages into the network model combining the keyword attention mechanism to obtain a first feature vector corresponding to each text message, including:

inputting each text message in the plurality of text messages into the second feature extraction model respectively to obtain a plurality of feature vectors corresponding to each text message;

inputting a plurality of feature vectors corresponding to each text message into a first pooling layer to obtain a pooling feature vector corresponding to each text message;

respectively inputting a plurality of characteristic vectors and keywords corresponding to each text message into a keyword characteristic extraction layer to obtain a plurality of characteristic vectors corresponding to each text message and combined with a keyword attention mechanism;

and inputting the pooled feature vector corresponding to each text message and a plurality of feature vectors combined with the keyword attention mechanism into a second pooling layer to obtain a first feature vector corresponding to each text message.

Further, if the text information is a text body, the network model combining the keyword attention mechanism further includes a third pooling layer, and before each of the plurality of text information is input to the second feature extraction model and the plurality of feature vectors corresponding to each of the text information is obtained, the document characterization method further includes:

inputting the text into a third pooling layer to obtain the text after pooling;

correspondingly, inputting each text message of the plurality of text messages into the second feature extraction model respectively to obtain a plurality of feature vectors corresponding to each text message, including:

and inputting the pooled texts into a second feature extraction model to obtain feature vectors corresponding to the texts.

Further, sequentially inputting the author list and each piece of text information in the plurality of pieces of text information to the first feature extraction model, and respectively obtaining a second feature vector corresponding to the author list and each piece of text information, including:

inputting the author list into the first feature extraction model to obtain a second feature vector corresponding to the author list;

segmenting each text message in the plurality of text messages to obtain a plurality of words corresponding to each text message;

and respectively inputting a plurality of words corresponding to each text message into the first feature extraction model to obtain a second feature vector corresponding to each text message.

Further, the first feature extraction model is a Bert model, and the second feature extraction model is a BM25 model.

Further, the merging network model includes a first merging layer, a splicing layer, a deep learning model, a second merging layer, a third feature extraction model, and a full connection layer, and a first feature vector and a second feature vector corresponding to each text message and a second feature vector corresponding to an author list are input into the merging network model for merging to obtain a characterization vector of the document to be characterized, including:

inputting the second feature vector corresponding to each text message into the first fusion layer for fusion to obtain a fused feature vector;

inputting the fused feature vector, a second feature vector corresponding to the author list and a first feature vector corresponding to each text message into a splicing layer for splicing to obtain a multi-channel feature parameter;

inputting the multi-channel characteristic parameters into a deep learning model to obtain a third characteristic vector;

inputting the titles and the abstracts in the text information into the second fusion layer for synthesis to obtain a synthetic document;

inputting the synthesized document into a third feature extraction model to obtain a fourth feature vector;

and inputting the third feature vector and the fourth feature vector into the full-connection layer to obtain a characterization vector of the document to be characterized.

Further, the deep learning model comprises a plurality of full connection layers and a plurality of activation functions, and the full connection layers and the activation functions are sequentially and alternately cascaded.

The invention also provides a document characterization device based on deep learning, which comprises:

the analysis module is used for analyzing the document to be characterized to obtain a keyword, an author list and a plurality of text messages of the document to be characterized;

the first feature extraction module is used for respectively inputting each piece of text information and a keyword in the plurality of pieces of text information into a network model combined with a keyword attention mechanism to obtain a first feature vector corresponding to each piece of text information;

the second feature extraction module is used for sequentially inputting the author list and each piece of text information in the plurality of pieces of text information into the first feature extraction model and respectively obtaining a second feature vector corresponding to the author list and each piece of text information;

and the fusion module is used for inputting the first feature vector and the second feature vector corresponding to each text message and the second feature vector corresponding to the author list into a fusion network model for fusion so as to obtain the representation vector of the document to be represented.

The invention also provides an apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement a document characterization method as defined in any one of the above.

The invention also provides a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a document characterization method as defined in any one of the above.

The document characterization method provided by the invention inputs each text message and the keyword in the plurality of text messages into the network model combined with the keyword attention mechanism to obtain a first feature vector combined with the keyword attention mechanism, then sequentially inputs each text message in the author list and the plurality of text messages into the first feature extraction model to obtain a second feature vector, and finally fuses the first feature vector and the second feature vector, so that the keyword information is fully utilized, a plurality of text data of the document are considered, and different feature extraction methods are adopted aiming at different text data, thereby effectively improving the precision of document vectorization characterization.

Drawings

The technical solution and other advantages of the present invention will become apparent from the following detailed description of specific embodiments of the present invention, which is to be read in connection with the accompanying drawings.

FIG. 1 is a schematic illustration of a document characterization method in an embodiment of the present application;

FIG. 2 is a schematic diagram of a network model incorporating a keyword attention mechanism in an embodiment of the present application;

FIG. 3 is another diagram of a network model incorporating keyword attention mechanism in an embodiment of the present application;

FIG. 4 is a schematic diagram of a converged network model in an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a document characterization device in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an apparatus in an embodiment of the present application.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. This invention may, however, be embodied in many different forms and should not be construed as limited to the specific embodiments set forth herein. Rather, these embodiments are provided to explain the principles of the invention and its practical application to thereby enable others skilled in the art to understand the invention for various embodiments and with various modifications as are suited to the particular use contemplated. In the drawings, like numbering will be used to refer to like elements throughout.

The document itself is a collection of various data, which can be roughly divided into two categories of text data and picture data, wherein the text data includes titles, author lists, keywords, abstracts, texts, references and references, while the picture data mainly includes paper illustrations, various data forms bring processing difficulties, and the text and the picture are unstructured information, from which we need to extract the semantics of structured expressions, and the precision of the step becomes the second difficulty faced by document vector representation. Author2Vec and Cite2Vec are commonly used literature vector characterization methods at present, but the two methods do not consider the diversity of text data, so that the information loss of literature vector characterization is caused, and the accuracy of literature vectorization characterization is reduced.

Based on the problems, the document characterization method based on deep learning is provided, a plurality of text data of the document are considered at the same time, different feature extraction methods are adopted for different text data, and the precision of vectorization characterization of the document can be effectively improved. Specifically, the method comprises the steps of firstly analyzing a document to be characterized to obtain a keyword, an author list and a plurality of text messages of the document to be characterized; respectively inputting each text message and the keywords in the plurality of text messages into a network model combined with a keyword attention mechanism to obtain a first feature vector corresponding to each text message; sequentially inputting the author list and each piece of text information in the plurality of pieces of text information into the first feature extraction model, and respectively obtaining second feature vectors corresponding to the author list and each piece of text information; and finally, inputting the first feature vector and the second feature vector corresponding to each text message and the second feature vector corresponding to the author list into a fusion network model for fusion to obtain the feature vector of the document to be characterized.

The document characterization method provided by the application makes full use of the keyword information, simultaneously considers a plurality of text data of the document, and adopts different feature extraction methods aiming at different text data, so that the precision of vectorization characterization of the document is effectively improved.

In the following, text data including an author list, a keyword, a title, an abstract and a text is taken as an example, and a detailed description is given to a document characterization method, a device, and a storage medium based on deep learning in the present application through a specific embodiment and with reference to the accompanying drawings, it should be noted that the text data including an author list, a keyword, a title, an abstract and a text is only taken as an example and is not used to limit the document characterization method in the present application, and the document characterization method in the present application may also be used in other text data.

Referring to fig. 1, the document characterization method provided in this embodiment includes the following steps:

s1, analyzing a document to be characterized to obtain a keyword, an author list and a plurality of text messages of the document to be characterized;

s2, respectively inputting each text message and the keywords in the plurality of text messages into a network model combined with a keyword attention mechanism to obtain a first feature vector corresponding to each text message;

s3, sequentially inputting the author list and each piece of text information in the plurality of pieces of text information to the first feature extraction model, and respectively obtaining a second feature vector corresponding to the author list and each piece of text information;

and S4, inputting the first feature vector and the second feature vector corresponding to each text message and the second feature vector corresponding to the author list into a fusion network model for fusion to obtain the feature vector of the document to be characterized.

Before analyzing the document to be characterized, the document characterization method in this embodiment further needs to load the document to be characterized first, where the loading mainly includes performing format conversion on the document to be characterized, that is, converting the format of the document to be characterized into a document in a predetermined data format, specifically, obtaining a file path of the document to be characterized first, and then reading the document to be characterized according to the file path and the predetermined data format, where the reading of the file is implemented by using a standard file stream in Python.

After the document to be characterized is loaded, the method enters step S1, the loaded document to be characterized is analyzed, and text data and image data of the document to be characterized are obtained, wherein the text data of the document to be characterized comprises keywords, an author list and a plurality of text messages, the text messages take titles, abstracts and texts as examples, specifically, a PDFMiner library in Python is adopted, different types of data are analyzed according to the characteristics of various data in the document to be characterized, the keywords, the author list, the titles, the abstracts and the texts of the document to be characterized are respectively obtained, and the analyzed text data and the analyzed image data are respectively stored in corresponding formats for subsequent calling.

The document data is different from other document data in that the document data includes keywords, which are basic components for expressing user information needs and retrieving topic contents, and thus the keywords play a very important role in natural language processing. In step S2, the document characterization method in this embodiment combines the keywords with other text information to obtain a feature vector of each text information, so as to obtain a first feature vector corresponding to each text information and combining a keyword attention mechanism.

Referring to fig. 2, the network model combining the keyword attention mechanism in this embodiment includes a second feature extraction model 11, a first pooling layer 12, a keyword feature extraction layer 13, and a second pooling layer 14, and step S2 specifically includes:

s21, inputting each text message in the plurality of text messages into a second feature extraction model respectively to obtain a plurality of feature vectors corresponding to each text message;

s22, inputting a plurality of feature vectors corresponding to each text message into a first pooling layer to obtain a pooling feature vector corresponding to each text message;

s23, respectively inputting a plurality of feature vectors and keywords corresponding to each text message into a keyword feature extraction layer to obtain a plurality of feature vectors corresponding to each text message and combined with a keyword attention mechanism;

and S24, inputting the pooled feature vector corresponding to each text message and a plurality of feature vectors combined with the keyword attention mechanism into a second pooling layer to obtain a first feature vector corresponding to each text message.

In other embodiments, a plurality of feature vectors of the keyword attention mechanism corresponding to each text message may be obtained first, and a plurality of feature vectors of the keyword attention mechanism corresponding to each text message may be obtained second, or a plurality of feature vectors of the keyword attention mechanism corresponding to each text message may be obtained at the same time, or a plurality of feature vectors of the keyword attention mechanism corresponding to each text message and a plurality of feature vectors of the keyword attention mechanism corresponding to each text message may be obtained at the same time, that is, step S22 and step S23 are not in sequence.

In this embodiment, the second feature extraction model 11 adopts a semantic feature model based on context, and preferably, the second feature extraction model 11 is a Bert (Bidirectional Encoder retrieval from transforms) model, and a feature vector representing context semantic information can be obtained through the second feature extraction model 11.

Specifically, before step S21, the second feature extraction model 11 needs to be pre-trained, that is, training data of the second feature extraction model 11 is obtained by combining documents in the scientific and technical document library, then the second feature extraction model 11 is pre-trained by using the obtained training data and context expectation of the scientific and technical document library to obtain new model parameters, and the initial second feature extraction model 11 is updated by using the new model parameters to obtain the trained second feature extraction model 11.

After the trained second feature model 11 is obtained, each piece of text information in the plurality of pieces of text information is input to the second feature extraction model 11, and a plurality of feature vectors corresponding to each piece of text information is obtained, where each piece of text information is composed of a plurality of words, a plurality of feature vectors are obtained after a plurality of words corresponding to each piece of text information are input to the second feature extraction model 11, the plurality of feature vectors correspond to the plurality of words one to one, for example, the text information is taken as an example, assuming that the abstract includes N words, the abstract is input to the second feature extraction model 11, and then N feature vectors with the length of L are obtained, and the N feature vectors with the length of L correspond to the N words one to one.

In step S22, in order to reduce the dimensionality of data processing and improve the data processing efficiency, in this embodiment, a first pooling layer 12 is added behind the second feature extraction model 11, and the N feature vectors output by the second feature extraction model 11 are pooled through the first pooling layer 12, where, to avoid losing too much semantic information, the first pooling layer 12 performs pooling by using an average pooling method, that is, a feature vector with a length of L is obtained by performing weighted average on the N feature vectors with a length of L, and finally, a pooled feature vector corresponding to each piece of text information is obtained.

In step S23, in order to obtain the feature vector of the combined keyword attention mechanism, the present embodiment adds a keyword feature extraction layer 13 behind the second feature extraction model 11, inputs a plurality of feature vectors corresponding to each piece of text information obtained in step S21 into the keyword feature extraction layer 13 together with the keyword, and the keyword feature extraction layer 13 screens m feature vectors including the keyword from the plurality of feature vectors and outputs the m feature vectors as the feature vector of the combined keyword attention mechanism of each piece of text information, where m is less than N.

Similarly, in step S24, in order to reduce the dimensionality of data processing and improve the data processing efficiency, a second pooling layer 14 is added behind the first pooling layer 12 and the keyword feature extraction layer 13, and the pooled feature vectors output by the first pooling layer 12 and the m feature vectors output by the keyword feature extraction layer 13 and combined with the keyword attention mechanism are pooled by the second pooling layer 14, where in order to avoid losing too much semantic information, the second pooling layer 14 is also pooled by using an average pooling method, and finally the first feature vector corresponding to each text information and combined with the keyword attention mechanism is obtained.

Referring to fig. 3, since the length of the text is greatly different from other text information, in order to avoid that the length of the text exceeds the processing capability of the second feature extraction model, the network model combining the keyword attention mechanism in this embodiment further includes a third pooling layer 15, the text is input into the third pooling layer 15, the text is subjected to window sliding processing and pooled by the third pooling layer 15, and the pooled text is obtained, where, in order to avoid excessive loss of semantic information, the third pooling layer 15 also performs pooling by using a mean pooling method. Then, the pooled text is input to the second feature extraction model 11, and the process proceeds to step S21.

In order to avoid the problem that the precision of the vectorization characterization method is reduced because information loss is caused by vectorization characterization of different text information by using the same vector characterization algorithm, in step S3, a writer list and a plurality of text information are subjected to vector characterization by using a first feature extraction model, and a second feature vector corresponding to the writer list and each text information is obtained respectively.

Specifically, step S3 includes:

s31, inputting the author list into the first feature extraction model to obtain a second feature vector corresponding to the author list;

s32, segmenting each text message in the plurality of text messages to obtain a plurality of words corresponding to each text message;

and S33, respectively inputting the plurality of words corresponding to each text message into the first feature extraction model to obtain a second feature vector corresponding to each text message.

In other embodiments, the second feature vector corresponding to each text message may be obtained first, and then the second feature vector corresponding to the author list may be obtained, or the author list and the second feature vector corresponding to each text message may be obtained simultaneously, that is, step S31 and steps S32 to S33 are not in sequence.

Before step S31, the first feature extraction model needs to be pre-trained, the second feature extraction model 11 needs to be pre-trained by using the training data, new model parameters are obtained, the initial first feature extraction model is updated by using the new model parameters, the trained first feature extraction model is obtained, the author list and each piece of text information in the plurality of pieces of text information are sequentially input to the first feature extraction model, and the author list and the second feature vector corresponding to each piece of text information are respectively obtained.

The bag-of-words model is a basis of natural language processing, and in the bag-of-words model, text information is regarded as a set of a series of words and a positional relationship of the words is ignored, that is, the bag-of-words model does not need to consider a positional sequence relationship between the words and can be well complementary to a model based on context semantic features, so that the first feature extraction model in the embodiment is the bag-of-words model, preferably, the first feature extraction model is a BM25 (Best Match 25) model, and bag-of-words features of each text information can be obtained through the first feature extraction model.

It should be noted that, since the author list is an ordered word set and does not need to obtain features based on context semantic information, in this embodiment, the author list only needs to perform feature extraction through the first feature extraction model, so as to implement a corresponding and appropriate feature extraction method for different text information. When obtaining the bag-of-words feature of the author list, the weight of the first author and the correspondent author of the document to be characterized is higher than that of the other authors, in this embodiment, the weight of the first author and the correspondent author of the document to be characterized is 2 times of the weight of the other authors, that is, the weight coefficient of the first author and the correspondent author relative to the other authors is 2, of course, this weight coefficient may be adjusted according to actual needs, and is not limited to 2.

In steps S32 to S33, before obtaining the bag-of-words feature, the text information, such as the title, the abstract, and the text, needs to be segmented to obtain a plurality of words corresponding to each text information. Preferably, the present embodiment adopts a method based on the N-gram assumption to perform word segmentation on the text information such as the title, the abstract and the body. In the bag-of-word model, the dimension of the output bag-of-word feature vector and the length of each dimension need to be considered, and when the vector representation is performed on the document, the importance of the title and the abstract is higher than that of the text, so that in the bag-of-word model, the dimension of the bag-of-word feature vector of the text is equal to that of the bag-of-word feature vector of the title and the abstract, that is, the dimension of the bag-of-word feature vector of the title and the abstract is taken as the dimension of the bag-of-word feature vector of the text, and the problem that the dimension of three different dimension vectors is increased in the subsequent fusion process is solved.

In other embodiments, the author list and the second feature vector corresponding to each piece of text information may be obtained first, and then the first feature vector corresponding to each piece of text information may be obtained, or the first feature vector corresponding to each piece of text information and the second feature vector corresponding to the author list and each piece of text information may be obtained at the same time, that is, the steps S2 and S3 are not in sequence.

After the first feature vector and the second feature vector of each text message and the second feature vector of the author list are obtained, the feature vectors need to be fused, so that the feature vectors of a plurality of text data of the document are obtained by using the keyword information and are considered at the same time, and the precision of vectorization representation of the document is effectively improved.

Specifically, referring to fig. 4, the fusion network model in this embodiment includes a first fusion layer 21, a splicing layer 22, a deep learning model 23, a second fusion layer 24, a third feature extraction model 25, and a full connection layer 26, and step S4 includes:

s41, inputting the second feature vector corresponding to each text message into the first fusion layer 21 for fusion to obtain a fused feature vector;

s42, inputting the fused feature vector, the second feature vector corresponding to the author list and the first feature vector corresponding to each text message into the splicing layer 22 for splicing to obtain multi-channel feature parameters;

s43, inputting the multi-channel characteristic parameters into the deep learning model 23 to obtain a third characteristic vector;

s44, inputting the titles and the abstracts in the text information into the second fusion layer 24 for synthesis to obtain a synthetic document;

s45, inputting the synthesized document into the third feature extraction model 25 to obtain a fourth feature vector;

and S46, inputting the third feature vector and the fourth feature vector into the full connection layer 26 to obtain a feature vector of the document to be characterized.

In other embodiments, the third feature vector may be obtained first and then the third feature vector may be obtained, or the third feature vector and the fourth feature vector may be obtained simultaneously, that is, steps S41 to S43 and steps S44 to S45 are not in sequence.

In steps S41 to S43, the second feature vector of each text message obtained in step S3 is input into the first fusion layer 21 for weighted addition to obtain a fused feature vector, where the weight of each text message may be set according to actual needs, for example, if the importance degrees of the title, the abstract, and the text are sequentially decreased, the weights corresponding to the title, the abstract, and the text are also sequentially decreased, so that more information including the title is contained in the obtained fused feature vector. Then, inputting the fused feature vector, the second feature vector corresponding to the author list, and the first feature vector of the title, the abstract, and the text into a splicing layer 22 for splicing to obtain a multi-channel feature parameter simultaneously including a context semantic feature and a bag-of-words feature of the title, the abstract, and the text in combination with a keyword attention mechanism, and a bag-of-words feature of the author list, and then inputting the multi-channel feature parameter into a deep learning model 23, where the deep learning model 23 in this embodiment includes a plurality of full connection layers and a plurality of activation functions, the full connection layers and the activation functions are sequentially and alternately cascaded, before step S4, the deep learning model 23 also needs to be pre-trained to obtain a trained deep learning model 23, and then the multi-channel feature parameter is input into the deep learning model 23.

Since the title and the abstract are parts on which the document representation mainly depends, in this embodiment, in steps S44 to S45, feature vectors of the title and the abstract are re-extracted, specifically, the title and the abstract are input into the second fusion layer 24 to be synthesized, so as to obtain a synthesized document, and then the synthesized document is input into the third feature extraction model 25, so as to obtain a fourth feature vector, preferably, the third feature extraction model 25 is a model using context-based semantic features, preferably, the third feature extraction model 25 is also a Bert (binary Encoder retrieval from transforms) model, and the fourth feature vector using context-based semantic information of the title and the abstract can be obtained through the third feature extraction model 25.

Finally, in step S46, the third feature vector and the fourth feature vector are input into the full connection layer 26, and the third feature vector and the fourth feature vector are spliced through the full connection layer 26 to obtain a feature vector of the document to be characterized.

After the document characterization method in this embodiment obtains the characterization vector of the document to be characterized, the characterization vector is stored in the csv format, so that each subsequent document analysis task can be called.

Referring to fig. 5, the present embodiment further provides a document characterization apparatus corresponding to the document characterization method, which includes an analysis module 31, a first feature extraction module 32, a second feature extraction module 33, and a fusion module 34.

Specifically, the parsing module 31 is configured to parse the document to be characterized to obtain a keyword, an author list, and a plurality of text messages of the document to be characterized. The first feature extraction module 32 is configured to input each text message and a keyword in the plurality of text messages into a network model combined with a keyword attention mechanism, and obtain a first feature vector corresponding to each text message. The second feature extraction module 33 is configured to sequentially input the author list and each piece of text information in the plurality of pieces of text information to the first feature extraction model, and obtain a second feature vector corresponding to the author list and each piece of text information, respectively. The fusion module 34 is configured to input the first feature vector and the second feature vector corresponding to each text message and the second feature vector corresponding to the author list into the fusion network model for fusion, so as to obtain a feature vector of the document to be characterized.

Referring to fig. 6, the present embodiment provides an apparatus, which includes a memory 100, a processor 200, and a network interface 202, where the memory 100 stores a computer program thereon, and the processor 200 executes the computer program to implement the document characterization method in the present embodiment.

The Memory 100 may include a Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The processor 200 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the document characterization method in this embodiment may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 200. The Processor 200 may also be a general-purpose Processor including a Central Processing Unit (CPU), a Network Processor (NP), etc., and may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component.

The memory 100 is used for storing a computer program, which the processor 200 executes to implement the document characterization method in the present embodiment after receiving the execution instruction.

The embodiment further provides a computer storage medium, in which a computer program is stored, and the processor 200 is configured to read and execute the computer program stored in the computer storage medium 201, so as to implement the document characterization method in the embodiment.

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer storage medium or transmitted from one computer storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer storage media may be any available media that can be accessed by a computer or a data storage device, such as a server, data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus, and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to embodiments of the present application and it is noted that numerous modifications and adaptations may be made by those skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims

1. A document characterization method based on deep learning, the document characterization method comprising:

sequentially inputting the author list and each piece of text information in the plurality of pieces of text information into the first feature extraction model, and respectively obtaining second feature vectors corresponding to the author list and each piece of text information;

inputting a first feature vector and a second feature vector corresponding to each text message and a second feature vector corresponding to an author list into a fusion network model for fusion to obtain a feature vector of the document to be characterized;

the network model combining the keyword attention mechanism comprises a second feature extraction model, a first pooling layer, a keyword feature extraction layer and a second pooling layer, each text message and the keywords in the plurality of text messages are respectively input into the network model combining the keyword attention mechanism, and a first feature vector corresponding to each text message is obtained, and the method comprises the following steps:

inputting each text message in the plurality of text messages to a second feature extraction model respectively to obtain a plurality of feature vectors corresponding to each text message;

inputting a plurality of feature vectors corresponding to each text message into a first pooling layer to obtain a pooled feature vector corresponding to each text message;

2. The method of claim 1, wherein if the text message is a body text, the network model with the keyword attention mechanism further includes a third pooling layer, and before each of the plurality of text messages is input to the second feature extraction model to obtain a plurality of feature vectors corresponding to each of the plurality of text messages, the method further includes:

inputting the text into a third pooling layer to obtain the pooled text;

3. The document characterization method according to claim 1, wherein the step of sequentially inputting the author list and each of the plurality of text messages into the first feature extraction model to obtain the second feature vector corresponding to the author list and each of the text messages respectively comprises:

4. The document characterization method according to claim 3, wherein the first feature extraction model is a Bert model and the second feature extraction model is a BM25 model.

5. The document characterization method according to claim 1, wherein the fusion network model includes a first fusion layer, a splicing layer, a deep learning model, a second fusion layer, a third feature extraction model, and a full connection layer, and the first feature vector and the second feature vector corresponding to each text message and the second feature vector corresponding to the author list are input into the fusion network model for fusion to obtain the characterization vector of the document to be characterized, including:

inputting titles and abstracts in the text information into the second fusion layer for synthesis to obtain a synthetic document;

6. The document characterization method according to claim 5, wherein the deep learning model comprises a plurality of fully connected layers and a plurality of activation functions, and the plurality of fully connected layers and the plurality of activation functions are alternately cascaded in sequence.

7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the document characterization method according to any one of claims 1 to 6.

8. A computer readable storage medium having computer instructions stored thereon, wherein the computer instructions, when executed by a processor, implement a document characterization method according to any one of claims 1 to 6.