CN114399775A - Document title generation method, device, equipment and storage medium - Google Patents

Document title generation method, device, equipment and storage medium Download PDF

Info

Publication number
CN114399775A
CN114399775A CN202210072397.XA CN202210072397A CN114399775A CN 114399775 A CN114399775 A CN 114399775A CN 202210072397 A CN202210072397 A CN 202210072397A CN 114399775 A CN114399775 A CN 114399775A
Authority
CN
China
Prior art keywords
information
vector
text
sub
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210072397.XA
Other languages
Chinese (zh)
Inventor
唐小初
张祎頔
舒畅
陈又新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202210072397.XA priority Critical patent/CN114399775A/en
Publication of CN114399775A publication Critical patent/CN114399775A/en
Priority to PCT/CN2022/090434 priority patent/WO2023137906A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to an artificial intelligence technology, and discloses a document title generation method, which comprises the following steps: dividing the original document information into blocks to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information; inputting a plurality of text sub-information into a text coding model for text coding to obtain a text characteristic vector; weighting and adding the text characteristic vector, the image characteristics in the image sub-information and the position-coded multidimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output characteristic; and performing characteristic decoding on the final output characteristics by using a decoder module to obtain a document block containing the title. In addition, the invention also relates to a block chain technology, and the image characteristics can be stored in the nodes of the block chain. The invention also provides a document title generation device, electronic equipment and a storage medium. The invention can improve the accuracy of generating the document title.

Description

Document title generation method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a document title generation method and device, electronic equipment and a computer readable storage medium.
Background
With the acceleration of the digitization process, the structural analysis and content extraction of carriers such as documents and images become a key ring about the success or failure of enterprise digital transformation, and the automatic, accurate and rapid information processing is of great importance for the improvement of productivity. Visual Rich text documents (visualrich Document) are very common and important file forms in people's daily work and life because of containing a large amount of text, layout and format information, but processing such documents, such as adding corresponding titles to the documents, is labor and time consuming.
The existing document title generation method is generally to compare a document to be processed with an existing document with a title, and generate a corresponding title for the document to be processed according to the comparison condition.
Disclosure of Invention
The invention provides a method and a device for generating a document title and a computer readable storage medium, and mainly aims to improve the accuracy of generating the document title.
In order to achieve the above object, the present invention provides a document title generating method, including:
acquiring original document information, wherein the original document information comprises original text information, original image information and original position information;
dividing the original document information into blocks to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information;
inputting a plurality of text sub-information into a pre-trained text coding model for text coding to obtain a text characteristic vector;
extracting image features in the image sub-information by using a preset feature extraction model;
carrying out position coding on the position sub-information to obtain a multi-dimensional position vector;
carrying out weighted addition on the text feature vector, the image features and the multi-dimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output feature;
and performing characteristic decoding on the final output characteristics by using a preset decoder module to obtain a document block containing a title.
Optionally, the performing feature decoding on the final output feature by using a preset decoder module to obtain a document block including a title includes:
performing first feature decoding on the final output feature by using a first decoder in the decoder module to obtain the original document information containing the labeled category;
selecting a document block which accords with a preset category in the original document information containing the labeling category;
and performing title classification on the document blocks conforming to the preset categories by using a second decoder in the decoder module to obtain document blocks containing titles.
Optionally, the performing position coding on the plurality of position sub-information to obtain a multi-dimensional position vector includes:
acquiring related data of the position sub-information, and mapping the related data to a preset dimension;
and coding the related data of the preset dimensionality through an embedding layer to obtain a multidimensional position vector.
Optionally, before the inputting the plurality of text sub-information into the pre-trained text coding model for text coding, the method further includes:
acquiring a training data set, inputting any training data in the training data set into a preset text coding model for different overfitting to obtain a first sentence vector and a second sentence vector;
calculating vector similarity between the first sentence vector and the second sentence vector;
calculating a target value corresponding to the training data according to the vector similarity and a preset target function formula;
and adjusting parameters of the text coding model according to the target value, and outputting the trained text coding model.
Optionally, the preset objective function formula is:
Figure BDA0003482582440000021
wherein l is a target value, τ is a temperature coefficient, hjAs a negative sample, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector.
Optionally, the calculating a vector similarity between the first sentence vector and the second sentence vector includes:
calculating a vector similarity between the first sentence vector and the second sentence vector using the following formula:
Figure BDA0003482582440000031
wherein, sim (h)1,h2) Is that it isVector similarity between first sentence vector and said second sentence vector, h1Is the first sentence vector, h2Is the second sentence vector, h1 TIs a transposed vector, | h, of the first sentence vector1| is a modulus of the first sentence vector, | h2And | is a modulus of the second sentence vector.
Optionally, the extracting, by using a preset feature extraction model, image features in the plurality of image sub information includes:
inputting the image sub-information into a convolution module in the feature extraction model to obtain convolution data;
and carrying out feature processing on the convolution data by using a feature module in the feature extraction model to obtain image features.
In order to solve the above problem, the present invention also provides a document title generating apparatus, including:
the block division module is used for acquiring original document information which comprises original text information, original image information and original position information, and carrying out block division on the original document information to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information;
the feature coding module is used for inputting the text sub-information into a pre-trained text coding model for text coding to obtain a text feature vector, extracting image features in the image sub-information by using a preset feature extraction model, and carrying out position coding on the position sub-information to obtain a multi-dimensional position vector;
the fusion coding module is used for performing weighted addition on the text characteristic vector, the image characteristic and the multidimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output characteristic;
and the characteristic decoding module is used for performing characteristic decoding on the final output characteristic by using a preset decoder module to obtain a document block containing a title.
In order to solve the above problem, the present invention also provides an electronic device, including:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the document title generation method described above.
In order to solve the above problem, the present invention also provides a computer-readable storage medium in which at least one computer program is stored, the at least one computer program being executed by a processor in an electronic device to implement the document title generation method described above.
The method comprises the steps of carrying out block division on original document information to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information, wherein the block division can provide the plurality of sub-information for subsequent processing, respectively coding the plurality of text sub-information, the plurality of image sub-information and the plurality of position sub-information to obtain a text characteristic vector, an image characteristic and a multi-dimensional position vector, carrying out weighted addition on the coded sub-information, inputting the coded sub-information into a transformer model for carrying out fusion coding and decoding in a decoder module, and obtaining a document block containing a title. Because three different pieces of sub information such as texts, images and positions are involved and are coded and weighted and fused, the document blocks containing the titles obtained by final decoding can be more accurate. Meanwhile, the decoder module comprises at least one decoder, so that the decoding accuracy can be improved. Therefore, the document title generation method, the document title generation device, the electronic equipment and the computer readable storage medium provided by the invention can solve the problem that the accuracy of document title generation is not high enough.
Drawings
FIG. 1 is a flowchart illustrating a document title generation method according to an embodiment of the present invention;
FIG. 2 is a functional block diagram of a document title generation apparatus according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an electronic device for implementing the document title generating method according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The embodiment of the application provides a document title generation method. The execution subject of the document title generation method includes, but is not limited to, at least one of electronic devices such as a server and a terminal, which can be configured to execute the method provided by the embodiment of the present application. In other words, the document title generation method may be executed by software or hardware installed in a terminal device or a server device, and the software may be a blockchain platform. The server includes but is not limited to: a single server, a server cluster, a cloud server or a cloud server cluster, and the like. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Referring to fig. 1, a flowchart of a document title generating method according to an embodiment of the present invention is shown. In this embodiment, the document title generating method includes:
s1, original document information is obtained, and the original document information comprises original text information, original image information and original position information.
In the embodiment of the present invention, the original Document information refers to a visual Rich text Document (visual Rich Document), where the visual Rich text Document refers to text data whose semantic structure is determined by text content and related to visual elements such as layout, table structure, and font.
In detail, the original document information includes original text information, original image information, and original position information, where the original text information refers to text content in the original document information, the original image information refers to image content related to layout or layout in the original document information, and the original position information refers to position content of different areas in the original document information.
S2, dividing the original document information into blocks to obtain a plurality of text sub information, a plurality of image sub information and a plurality of position sub information.
In the embodiment of the present invention, the original document information is divided into a plurality of blocks by block division, and the standard of the block division refers to different contents contained in the original document information, so that a plurality of text sub-information, a plurality of image sub-information, and a plurality of position sub-information are obtained after the block division. The text sub-information refers to a block containing document text content, the image sub-information refers to a block containing document image content, and the position sub-information refers to data information containing position ranges of different blocks.
And S3, inputting the text sub-information into a pre-trained text coding model for text coding to obtain a text feature vector.
In the embodiment of the invention, the text coding model is an SimCSE model, wherein the SimCSE model is trained by a contrast learning method and can be used for supervised and unsupervised learning.
Specifically, before the text sub-information is input into a pre-trained text coding model for text coding, the method further includes:
acquiring a training data set, inputting any training data in the training data set into a preset text coding model for different overfitting to obtain a first sentence vector and a second sentence vector;
calculating vector similarity between the first sentence vector and the second sentence vector;
calculating a target value corresponding to the training data according to the vector similarity and a preset target function formula;
and adjusting parameters of the text coding model according to the target value, and outputting the trained text coding model.
In detail, the training data set comprises a plurality of text information data, and the training data set is used for training a preset text coding model, so that the obtained trained text coding model has more accurate text coding capability. And fitting the training data by using different dropout masks in the text coding model to obtain a first sentence vector and a second sentence vector. The vector similarity between the first sentence vector and the second sentence vector needs to be calculated, a target value corresponding to the training data is calculated according to the vector similarity, the text coding model is subjected to parameter adjustment according to the target value, and the target value can be used as a reference value for judging whether the model parameter adjustment is carried out or not.
Further, the calculating a vector similarity between the first sentence vector and the second sentence vector comprises:
calculating a vector similarity between the first sentence vector and the second sentence vector using the following formula:
Figure BDA0003482582440000061
wherein, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector, h1Is the first sentence vector, h2Is the second sentence vector, h1 TIs a transposed vector, | h, of the first sentence vector1| is a modulus of the first sentence vector, | h2And | is a modulus of the second sentence vector.
Specifically, the preset objective function formula is as follows:
Figure BDA0003482582440000071
wherein l is a target value, τ is a temperature coefficient, hjAs a negative sample, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector.
In detail, parameter adjustment is performed on the text coding model according to the target value, that is, the size between the target value and a preset target threshold is judged, when the target value is greater than or equal to the target threshold, the text coding model is output as a trained text coding model, when the target value is less than the target threshold, model parameters of the text coding model are adjusted, training data are input into the text coding model after model parameters are adjusted, a new sentence vector is obtained, a corresponding target value is calculated according to the new sentence vector, and when the target value is greater than or equal to the target threshold, the text coding model after model parameters are adjusted is output as the trained text coding model.
And S4, extracting image features in the image sub-information by using a preset feature extraction model.
In the embodiment of the invention, the feature extraction model is a ResNet-50 model. The ResNet-50 model comprises two modules, namely a convolution module (Conv Block) and a feature module (identity Block).
Specifically, the extracting, by using a preset feature extraction model, image features in a plurality of image sub-information includes:
inputting the image sub-information into a convolution module in the feature extraction model to obtain convolution data;
and carrying out feature processing on the convolution data by using a feature module in the feature extraction model to obtain image features.
In detail, the convolution module may perform a multiple convolution process and a pooling process on the image sub-information. The multiple convolution processing means that convolution operation is performed on the image sub-information at least once by using a convolution kernel, so that data after the multiple convolution processing is richer, and pooling processing is further performed, wherein the pooling processing is used for converting the data after the multiple convolution processing into data with the same dimensionality. The characteristic module is a residual error network, and the accuracy of characteristic extraction can be improved.
In another embodiment of the present invention, the extracting image features in a plurality of image sub-information by using a preset feature extraction model includes:
inputting original image information in the original document information into the feature extraction model to obtain global image features;
acquiring central coordinates in the image sub-information, and zooming the central coordinates according to a preset zooming proportion to obtain final coordinates;
and performing feature interception on the global image features by using the final coordinates to obtain image features in the image sub-information.
Wherein the feature extraction model is also a ResNet-50 model.
And S5, carrying out position coding on the position sub-information to obtain a multi-dimensional position vector.
In this embodiment of the present invention, the performing position coding on a plurality of position sub-information to obtain a multidimensional position vector includes:
acquiring related data of the position sub-information, and mapping the related data to a preset dimension;
and coding the related data of the preset dimensionality through an embedding layer to obtain a multidimensional position vector.
In detail, the related data of the position sub-information may be xmin, ymin, xmax, ymax, height, width of the bounding box of each block. And mapping the relevant data to a preset dimension, and then encoding the relevant data of the preset dimension through an embedding layer, wherein the embedding layer belongs to an embedding layer in a neural network and consists of a plurality of neurons.
S6, carrying out weighted addition on the text feature vector, the image feature and the multi-dimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output feature.
In this embodiment of the present invention, the performing weighted addition on the text feature vector, the image feature, and the multidimensional position vector includes:
and carrying out weighted addition on the text characteristic vector, the image characteristic and the multi-dimensional position vector according to a preset weighted addition formula to obtain a final input vector.
Specifically, the preset weighted addition formula is as follows:
F=α*w1+β*w2+γ*w3
where F is the final input vector, w1For the text feature vector, w2For the image feature, w3For the multi-dimensional position vector, α, β, and γ are respectively preset different weights.
Further, the transform encoder model includes six transform encoder layers, the final input vector is input into a first transform encoder layer in the transform encoder model to obtain a first output characteristic, the first output characteristic is used as the input of a second transform encoder layer in the transform encoder model, and by analogy, the output of the previous transform encoder layer is used as the input of the next transform encoder layer to realize fusion encoding, and the final output characteristic is obtained.
And S7, performing feature decoding on the final output features by using a preset decoder module to obtain a document block containing a title.
In this embodiment of the present invention, the performing feature decoding on the final output feature by using a preset decoder module to obtain a document block including a title includes:
performing first feature decoding on the final output feature by using a first decoder in the decoder module to obtain the original document information containing the labeled category;
selecting a document block which accords with a preset category in the original document information containing the labeling category;
and performing title classification on the document blocks conforming to the preset categories by using a second decoder in the decoder module to obtain document blocks containing titles.
In detail, the decryptor model is constructed by two decoders, namely a first decoder and a second decoder, wherein the first decoder and the second decoder may be transform decoders. The transformer decoder breaks through the limitation that RNN model can not calculate in parallel, and the number of operations required for the transformer decoder to calculate the association between two positions does not increase with distance compared with CNN model, so the transformer decoder is more advantageous.
Specifically, the final output features are input into the first decoder for first feature decoding, so as to obtain the original document information including the labeled categories, where in the present solution, the labeled categories may be a title, a body, a header, a footer, and the like. And the preset-type document blocks are documents containing titles, and the second decoder in the decoder module is used for performing title classification on the document blocks conforming to the preset type to obtain the document blocks containing the titles, wherein the title classification refers to the step of dividing the titles in the documents into primary titles, secondary titles and tertiary titles.
The method comprises the steps of carrying out block division on original document information to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information, wherein the block division can provide the plurality of sub-information for subsequent processing, respectively coding the plurality of text sub-information, the plurality of image sub-information and the plurality of position sub-information to obtain a text characteristic vector, an image characteristic and a multi-dimensional position vector, carrying out weighted addition on the coded sub-information, inputting the coded sub-information into a transformer model for carrying out fusion coding and decoding in a decoder module, and obtaining a document block containing a title. Because three different pieces of sub information such as texts, images and positions are involved and are coded and weighted and fused, the document blocks containing the titles obtained by final decoding can be more accurate. Meanwhile, the decoder module comprises at least one decoder, so that the decoding accuracy can be improved. Therefore, the document title generation method provided by the invention can solve the problem that the accuracy of document title generation is not high enough.
Fig. 2 is a functional block diagram of a document title generation apparatus according to an embodiment of the present invention.
The document title generation apparatus 100 according to the present invention may be installed in an electronic device. According to the implemented functions, the document title generating device 100 may include a block dividing module 101, a feature encoding module 102, a fusion encoding module 103, and a feature decoding module 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the block division module 101 is configured to obtain original document information, where the original document information includes original text information, original image information, and original position information, and perform block division on the original document information to obtain a plurality of text sub-information, a plurality of image sub-information, and a plurality of position sub-information;
the feature coding module 102 is configured to input the plurality of text sub-information into a pre-trained text coding model for text coding to obtain a text feature vector, extract image features in the plurality of image sub-information by using a preset feature extraction model, perform position coding on the plurality of position sub-information, and obtain a multi-dimensional position vector;
the fusion coding module 103 is configured to perform weighted addition on the text feature vector, the image feature, and the multidimensional position vector to obtain a final input vector, and input the final input vector into a transformer encoder model for fusion coding to obtain a final output feature;
the feature decoding module 104 is configured to perform feature decoding on the final output feature by using a preset decoder module to obtain a document block including a title.
In detail, the document title generating apparatus 100 includes the following modules:
the method comprises the steps of firstly, obtaining original document information, wherein the original document information comprises original text information, original image information and original position information.
In the embodiment of the present invention, the original Document information refers to a visual Rich text Document (visual Rich Document), where the visual Rich text Document refers to text data whose semantic structure is determined by text content and related to visual elements such as layout, table structure, and font.
In detail, the original document information includes original text information, original image information, and original position information, where the original text information refers to text content in the original document information, the original image information refers to image content related to layout or layout in the original document information, and the original position information refers to position content of different areas in the original document information.
And secondly, carrying out block division on the original document information to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information.
In the embodiment of the present invention, the original document information is divided into a plurality of blocks by block division, and the standard of the block division refers to different contents contained in the original document information, so that a plurality of text sub-information, a plurality of image sub-information, and a plurality of position sub-information are obtained after the block division. The text sub-information refers to a block containing document text content, the image sub-information refers to a block containing document image content, and the position sub-information refers to data information containing position ranges of different blocks.
And step three, inputting the text sub-information into a pre-trained text coding model for text coding to obtain a text characteristic vector.
In the embodiment of the invention, the text coding model is an SimCSE model, wherein the SimCSE model is trained by a contrast learning method and can be used for supervised and unsupervised learning.
Specifically, before inputting the plurality of text sub-information into a pre-trained text coding model for text coding, the following steps are further performed:
acquiring a training data set, inputting any training data in the training data set into a preset text coding model for different overfitting to obtain a first sentence vector and a second sentence vector;
calculating vector similarity between the first sentence vector and the second sentence vector;
calculating a target value corresponding to the training data according to the vector similarity and a preset target function formula;
and adjusting parameters of the text coding model according to the target value, and outputting the trained text coding model.
In detail, the training data set comprises a plurality of text information data, and the training data set is used for training a preset text coding model, so that the obtained trained text coding model has more accurate text coding capability. And fitting the training data by using different dropout masks in the text coding model to obtain a first sentence vector and a second sentence vector. The vector similarity between the first sentence vector and the second sentence vector needs to be calculated, a target value corresponding to the training data is calculated according to the vector similarity, the text coding model is subjected to parameter adjustment according to the target value, and the target value can be used as a reference value for judging whether the model parameter adjustment is carried out or not.
Further, the calculating a vector similarity between the first sentence vector and the second sentence vector comprises:
calculating a vector similarity between the first sentence vector and the second sentence vector using the following formula:
Figure BDA0003482582440000121
wherein, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector, h1Is the first sentence vector, h2Is the second sentence vector, h1 TIs a transposed vector, | h, of the first sentence vector1| is a modulus of the first sentence vector, | h2And | is a modulus of the second sentence vector.
Specifically, the preset objective function formula is as follows:
Figure BDA0003482582440000122
wherein l is a target value, τ is a temperature coefficient, hjAs a negative sample, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector.
In detail, parameter adjustment is performed on the text coding model according to the target value, that is, the size between the target value and a preset target threshold is judged, when the target value is greater than or equal to the target threshold, the text coding model is output as a trained text coding model, when the target value is less than the target threshold, model parameters of the text coding model are adjusted, training data are input into the text coding model after model parameters are adjusted, a new sentence vector is obtained, a corresponding target value is calculated according to the new sentence vector, and when the target value is greater than or equal to the target threshold, the text coding model after model parameters are adjusted is output as the trained text coding model.
And step four, extracting the image characteristics in the image sub-information by using a preset characteristic extraction model.
In the embodiment of the invention, the feature extraction model is a ResNet-50 model. The ResNet-50 model comprises two modules, namely a convolution module (Conv Block) and a feature module (identity Block).
Specifically, the extracting, by using a preset feature extraction model, image features in a plurality of image sub-information includes:
inputting the image sub-information into a convolution module in the feature extraction model to obtain convolution data;
and carrying out feature processing on the convolution data by using a feature module in the feature extraction model to obtain image features.
In detail, the convolution module may perform a multiple convolution process and a pooling process on the image sub-information. The multiple convolution processing means that convolution operation is performed on the image sub-information at least once by using a convolution kernel, so that data after the multiple convolution processing is richer, and pooling processing is further performed, wherein the pooling processing is used for converting the data after the multiple convolution processing into data with the same dimensionality. The characteristic module is a residual error network, and the accuracy of characteristic extraction can be improved.
In another embodiment of the present invention, the extracting image features in a plurality of image sub-information by using a preset feature extraction model includes:
inputting original image information in the original document information into the feature extraction model to obtain global image features;
acquiring central coordinates in the image sub-information, and zooming the central coordinates according to a preset zooming proportion to obtain final coordinates;
and performing feature interception on the global image features by using the final coordinates to obtain image features in the image sub-information.
Wherein the feature extraction model is also a ResNet-50 model.
And fifthly, carrying out position coding on the position sub-information to obtain a multi-dimensional position vector.
In this embodiment of the present invention, the performing position coding on a plurality of position sub-information to obtain a multidimensional position vector includes:
acquiring related data of the position sub-information, and mapping the related data to a preset dimension;
and coding the related data of the preset dimensionality through an embedding layer to obtain a multidimensional position vector.
In detail, the related data of the position sub-information may be xmin, ymin, xmax, ymax, height, width of the bounding box of each block. And mapping the relevant data to a preset dimension, and then encoding the relevant data of the preset dimension through an embedding layer, wherein the embedding layer belongs to an embedding layer in a neural network and consists of a plurality of neurons.
And step six, carrying out weighted addition on the text characteristic vector, the image characteristic and the multi-dimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output characteristic.
In this embodiment of the present invention, the performing weighted addition on the text feature vector, the image feature, and the multidimensional position vector includes:
and carrying out weighted addition on the text characteristic vector, the image characteristic and the multi-dimensional position vector according to a preset weighted addition formula to obtain a final input vector.
Specifically, the preset weighted addition formula is as follows:
F=α*w1+β*w2+γ*w3
where F is the final input vector, w1For the text feature vector, w2For the image feature, w3For the multi-dimensional position vector, α, β, and γ are respectively preset different weights.
Further, the transform encoder model includes six transform encoder layers, the final input vector is input into a first transform encoder layer in the transform encoder model to obtain a first output characteristic, the first output characteristic is used as the input of a second transform encoder layer in the transform encoder model, and by analogy, the output of the previous transform encoder layer is used as the input of the next transform encoder layer to realize fusion encoding, and the final output characteristic is obtained.
And seventhly, performing feature decoding on the final output features by using a preset decoder module to obtain a document block containing a title.
In this embodiment of the present invention, the performing feature decoding on the final output feature by using a preset decoder module to obtain a document block including a title includes:
performing first feature decoding on the final output feature by using a first decoder in the decoder module to obtain the original document information containing the labeled category;
selecting a document block which accords with a preset category in the original document information containing the labeling category;
and performing title classification on the document blocks conforming to the preset categories by using a second decoder in the decoder module to obtain document blocks containing titles.
In detail, the decryptor model is constructed by two decoders, namely a first decoder and a second decoder, wherein the first decoder and the second decoder may be transform decoders. The transformer decoder breaks through the limitation that RNN model can not calculate in parallel, and the number of operations required for the transformer decoder to calculate the association between two positions does not increase with distance compared with CNN model, so the transformer decoder is more advantageous.
Specifically, the final output features are input into the first decoder for first feature decoding, so as to obtain the original document information including the labeled categories, where in the present solution, the labeled categories may be a title, a body, a header, a footer, and the like. And the preset-type document blocks are documents containing titles, and the second decoder in the decoder module is used for performing title classification on the document blocks conforming to the preset type to obtain the document blocks containing the titles, wherein the title classification refers to the step of dividing the titles in the documents into primary titles, secondary titles and tertiary titles.
The method comprises the steps of carrying out block division on original document information to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information, wherein the block division can provide the plurality of sub-information for subsequent processing, respectively coding the plurality of text sub-information, the plurality of image sub-information and the plurality of position sub-information to obtain a text characteristic vector, an image characteristic and a multi-dimensional position vector, carrying out weighted addition on the coded sub-information, inputting the coded sub-information into a transformer model for carrying out fusion coding and decoding in a decoder module, and obtaining a document block containing a title. Because three different pieces of sub information such as texts, images and positions are involved and are coded and weighted and fused, the document blocks containing the titles obtained by final decoding can be more accurate. Meanwhile, the decoder module comprises at least one decoder, so that the decoding accuracy can be improved. Therefore, the document title generation device provided by the invention can solve the problem that the accuracy of document title generation is not high enough.
Fig. 3 is a schematic structural diagram of an electronic device for implementing a document title generating method according to an embodiment of the present invention.
The electronic device 1 may comprise a processor 10, a memory 11, a communication bus 12 and a communication interface 13, and may further comprise a computer program, such as a document title generation program, stored in the memory 11 and executable on the processor 10.
In some embodiments, the processor 10 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same function or different functions, and includes one or more Central Processing Units (CPUs), a microprocessor, a digital Processing chip, a graphics processor, a combination of various control chips, and the like. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (e.g., executing a document title generation program, etc.) stored in the memory 11 and calling data stored in the memory 11.
The memory 11 includes at least one type of readable storage medium including flash memory, removable hard disks, multimedia cards, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disks, optical disks, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device, for example a removable hard disk of the electronic device. The memory 11 may also be an external storage device of the electronic device in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device. The memory 11 may be used not only to store application software installed in the electronic device and various types of data, such as codes of a document title generation program, etc., but also to temporarily store data that has been output or is to be output.
The communication bus 12 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
The communication interface 13 is used for communication between the electronic device and other devices, and includes a network interface and a user interface. Optionally, the network interface may include a wired interface and/or a wireless interface (e.g., WI-FI interface, bluetooth interface, etc.), which are typically used to establish a communication connection between the electronic device and other electronic devices. The user interface may be a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so that functions of charge management, discharge management, power consumption management and the like are realized through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The document title generation program stored in the memory 11 of the electronic device 1 is a combination of instructions, which when executed in the processor 10, can implement:
acquiring original document information, wherein the original document information comprises original text information, original image information and original position information;
dividing the original document information into blocks to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information;
inputting a plurality of text sub-information into a pre-trained text coding model for text coding to obtain a text characteristic vector;
extracting image features in the image sub-information by using a preset feature extraction model;
carrying out position coding on the position sub-information to obtain a multi-dimensional position vector;
carrying out weighted addition on the text feature vector, the image features and the multi-dimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output feature;
and performing characteristic decoding on the final output characteristics by using a preset decoder module to obtain a document block containing a title.
Specifically, the specific implementation method of the instruction by the processor 10 may refer to the description of the relevant steps in the embodiment corresponding to the drawings, which is not described herein again.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer readable storage medium may be volatile or non-volatile. For example, the computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
The present invention also provides a computer-readable storage medium, storing a computer program which, when executed by a processor of an electronic device, may implement:
acquiring original document information, wherein the original document information comprises original text information, original image information and original position information;
dividing the original document information into blocks to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information;
inputting a plurality of text sub-information into a pre-trained text coding model for text coding to obtain a text characteristic vector;
extracting image features in the image sub-information by using a preset feature extraction model;
carrying out position coding on the position sub-information to obtain a multi-dimensional position vector;
carrying out weighted addition on the text feature vector, the image features and the multi-dimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output feature;
and performing characteristic decoding on the final output characteristics by using a preset decoder module to obtain a document block containing a title.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A document title generation method, the method comprising:
acquiring original document information, wherein the original document information comprises original text information, original image information and original position information;
dividing the original document information into blocks to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information;
inputting a plurality of text sub-information into a pre-trained text coding model for text coding to obtain a text characteristic vector;
extracting image features in the image sub-information by using a preset feature extraction model;
carrying out position coding on the position sub-information to obtain a multi-dimensional position vector;
carrying out weighted addition on the text feature vector, the image features and the multi-dimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output feature;
and performing characteristic decoding on the final output characteristics by using a preset decoder module to obtain a document block containing a title.
2. The method for generating a document title according to claim 1, wherein said performing feature decoding on the final output feature by using a preset decoder module to obtain a document block containing a title comprises:
performing first feature decoding on the final output feature by using a first decoder in the decoder module to obtain the original document information containing the labeled category;
selecting a document block which accords with a preset category in the original document information containing the labeling category;
and performing title classification on the document blocks conforming to the preset categories by using a second decoder in the decoder module to obtain document blocks containing titles.
3. The method for generating a document title according to claim 1, wherein said position-coding a plurality of said position sub-information to obtain a multi-dimensional position vector comprises:
acquiring related data of the position sub-information, and mapping the related data to a preset dimension;
and coding the related data of the preset dimensionality through an embedding layer to obtain a multidimensional position vector.
4. The method of claim 1, wherein before inputting the plurality of text sub-information into a pre-trained text coding model for text coding, the method further comprises:
acquiring a training data set, inputting any training data in the training data set into a preset text coding model for different overfitting to obtain a first sentence vector and a second sentence vector;
calculating vector similarity between the first sentence vector and the second sentence vector;
calculating a target value corresponding to the training data according to the vector similarity and a preset target function formula;
and adjusting parameters of the text coding model according to the target value, and outputting the trained text coding model.
5. The document title generating method according to claim 4, wherein said preset objective function formula is:
Figure FDA0003482582430000021
wherein l is a target value, τ is a temperature coefficient, hjAs a negative sample, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector.
6. The document title generation method of claim 4, wherein said calculating a vector similarity between said first sentence vector and said second sentence vector comprises:
calculating a vector similarity between the first sentence vector and the second sentence vector using the following formula:
Figure FDA0003482582430000022
wherein, sim (h)1,h2) Is the vector similarity between the first sentence vector and the second sentence vector, h1Is the first sentence vector, h2Is the second sentence vector, h1 TIs a transposed vector, | h, of the first sentence vector1| is a modulus of the first sentence vector, | h2And | is a modulus of the second sentence vector.
7. The document title generating method according to any one of claims 1 to 6, wherein said extracting image features in a plurality of said image sub information by using a preset feature extraction model comprises:
inputting the image sub-information into a convolution module in the feature extraction model to obtain convolution data;
and carrying out feature processing on the convolution data by using a feature module in the feature extraction model to obtain image features.
8. A document title generation apparatus, characterized in that the apparatus comprises:
the block division module is used for acquiring original document information which comprises original text information, original image information and original position information, and carrying out block division on the original document information to obtain a plurality of text sub-information, a plurality of image sub-information and a plurality of position sub-information;
the feature coding module is used for inputting the text sub-information into a pre-trained text coding model for text coding to obtain a text feature vector, extracting image features in the image sub-information by using a preset feature extraction model, and carrying out position coding on the position sub-information to obtain a multi-dimensional position vector;
the fusion coding module is used for performing weighted addition on the text characteristic vector, the image characteristic and the multidimensional position vector to obtain a final input vector, and inputting the final input vector into a transformer encoder model for fusion coding to obtain a final output characteristic;
and the characteristic decoding module is used for performing characteristic decoding on the final output characteristic by using a preset decoder module to obtain a document block containing a title.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the document title generation method according to any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the document title generation method according to any one of claims 1 to 7.
CN202210072397.XA 2022-01-21 2022-01-21 Document title generation method, device, equipment and storage medium Pending CN114399775A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210072397.XA CN114399775A (en) 2022-01-21 2022-01-21 Document title generation method, device, equipment and storage medium
PCT/CN2022/090434 WO2023137906A1 (en) 2022-01-21 2022-04-29 Document title generation method and apparatus, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210072397.XA CN114399775A (en) 2022-01-21 2022-01-21 Document title generation method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114399775A true CN114399775A (en) 2022-04-26

Family

ID=81233667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210072397.XA Pending CN114399775A (en) 2022-01-21 2022-01-21 Document title generation method, device, equipment and storage medium

Country Status (2)

Country Link
CN (1) CN114399775A (en)
WO (1) WO2023137906A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115270778A (en) * 2022-08-25 2022-11-01 北京达佳互联信息技术有限公司 Title simplifying method, device, equipment and storage medium
WO2023137906A1 (en) * 2022-01-21 2023-07-27 平安科技(深圳)有限公司 Document title generation method and apparatus, device and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112667800A (en) * 2020-12-21 2021-04-16 深圳壹账通智能科技有限公司 Keyword generation method and device, electronic equipment and computer storage medium
CN113434684B (en) * 2021-07-01 2022-03-08 北京中科研究院 Rumor detection method, system, equipment and storage medium for self-supervision learning
CN113742483A (en) * 2021-08-27 2021-12-03 北京百度网讯科技有限公司 Document classification method and device, electronic equipment and storage medium
CN113888475A (en) * 2021-09-10 2022-01-04 上海商汤智能科技有限公司 Image detection method, training method of related model, related device and equipment
CN113869017A (en) * 2021-09-30 2021-12-31 平安科技(深圳)有限公司 Table image reconstruction method, device, equipment and medium based on artificial intelligence
CN114399775A (en) * 2022-01-21 2022-04-26 平安科技(深圳)有限公司 Document title generation method, device, equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023137906A1 (en) * 2022-01-21 2023-07-27 平安科技(深圳)有限公司 Document title generation method and apparatus, device and storage medium
CN115270778A (en) * 2022-08-25 2022-11-01 北京达佳互联信息技术有限公司 Title simplifying method, device, equipment and storage medium
CN115270778B (en) * 2022-08-25 2023-10-17 北京达佳互联信息技术有限公司 Title simplification method, device, equipment and storage medium

Also Published As

Publication number Publication date
WO2023137906A1 (en) 2023-07-27

Similar Documents

Publication Publication Date Title
CN113822494B (en) Risk prediction method, device, equipment and storage medium
CN111695439A (en) Image structured data extraction method, electronic device and storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN112380870A (en) User intention analysis method and device, electronic equipment and computer storage medium
CN114399775A (en) Document title generation method, device, equipment and storage medium
CN112988963A (en) User intention prediction method, device, equipment and medium based on multi-process node
CN113704429A (en) Semi-supervised learning-based intention identification method, device, equipment and medium
CN114398557A (en) Information recommendation method and device based on double portraits, electronic equipment and storage medium
CN114550870A (en) Prescription auditing method, device, equipment and medium based on artificial intelligence
CN114708461A (en) Multi-modal learning model-based classification method, device, equipment and storage medium
CN114416939A (en) Intelligent question and answer method, device, equipment and storage medium
CN113821622A (en) Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN113935880A (en) Policy recommendation method, device, equipment and storage medium
CN113821602A (en) Automatic answering method, device, equipment and medium based on image-text chatting record
CN113157739A (en) Cross-modal retrieval method and device, electronic equipment and storage medium
CN112269875A (en) Text classification method and device, electronic equipment and storage medium
CN114943306A (en) Intention classification method, device, equipment and storage medium
CN115346095A (en) Visual question answering method, device, equipment and storage medium
CN115238115A (en) Image retrieval method, device and equipment based on Chinese data and storage medium
CN113536782A (en) Sensitive word recognition method and device, electronic equipment and storage medium
CN114639109A (en) Image processing method and device, electronic equipment and storage medium
CN114398890A (en) Text enhancement method, device, equipment and storage medium
CN112631589A (en) Application program home page layout configuration method and device, electronic equipment and storage medium
CN111680513B (en) Feature information identification method and device and computer readable storage medium
US20230376687A1 (en) Multimodal extraction across multiple granularities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination