CN113159013B

CN113159013B - Paragraph identification method, device, computer equipment and medium based on machine learning

Info

Publication number: CN113159013B
Application number: CN202110467091.XA
Authority: CN
Inventors: 吴天博; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2024-05-07
Anticipated expiration: 2041-04-28
Also published as: CN113159013A

Abstract

The application relates to the field of artificial intelligence, and aims to automatically identify and combine error segments in an editable document obtained by converting a non-editable document, so as to improve the usability of the editable document. To a paragraph recognition method, apparatus, computer device and medium based on machine learning, the method comprising: acquiring context data to be combined and acquiring image data corresponding to the context data; inputting image data into a target detection model for feature extraction to obtain an image feature vector of the image data, and inputting context data into a word vector model for vectorization to obtain a word feature vector; inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result of the context data; and merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result. Furthermore, the present application relates to blockchain technology in which context data may be stored.

Description

Paragraph identification method, device, computer equipment and medium based on machine learning

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a paragraph recognition method, apparatus, computer device, and medium based on machine learning.

Background

The PDF document is a document which can carry a document format and is very convenient to use. However, in practical applications, since the PDF document is a non-editable document, there is a need for a user to convert the PDF document into an editable WORD document. The existing document conversion method mainly divides a PDF document into different blocks, identifies information such as characters, pictures and tables in the different blocks, and finally combines information corresponding to all the blocks so as to achieve the purpose of reserving a format. However, the original paragraph in the PDF document is wrong segmented in the converted WORD document, and the wrong segmentation does not correspond to the original paragraph, so that the availability of the WORD document is low.

Therefore, how to improve the usability of an editable document after converting an uneditable document into an editable document has become a problem to be solved.

Disclosure of Invention

The application provides a paragraph identification method, a device, computer equipment and a medium based on machine learning, which are used for automatically identifying and merging error segments in an editable document obtained by converting an uneditable document without manually adjusting the paragraphs by extracting characteristics of context data and image data corresponding to the context data and inputting the obtained image characteristic vector and text characteristic vector into a paragraph prediction model to carry out paragraph prediction.

In a first aspect, the present application provides a paragraph identification method based on machine learning, the method comprising:

Acquiring context data to be combined, and acquiring image data corresponding to the context data, wherein the context data is text corresponding to the image data;

Inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain a word feature vector corresponding to the context data;

Inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result corresponding to the context data;

And merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

In a second aspect, the present application further provides a paragraph identifying device based on machine learning, the device comprising:

The data acquisition module is used for acquiring context data to be combined and acquiring image data corresponding to the context data, wherein the context data is words corresponding to the image data;

The feature extraction module is used for inputting the image data into a target detection model to perform feature extraction to obtain an image feature vector corresponding to the image data, and inputting the context data into a word vector model to perform vectorization to obtain a word feature vector corresponding to the context data;

The paragraph prediction module is used for inputting the image feature vector and the text feature vector into a paragraph prediction model to perform paragraph prediction, so as to obtain a paragraph prediction result corresponding to the context data;

And the paragraph merging module is used for merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

In a third aspect, the present application also provides a computer device comprising a memory and a processor;

the memory is used for storing a computer program;

The processor is configured to execute the computer program and implement the paragraph identification method based on machine learning as described above when the computer program is executed.

In a fourth aspect, the present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement a paragraph identification method based on machine learning as described above.

The application discloses a paragraph identification method, a device, computer equipment and a medium based on machine learning, which can respectively extract the characteristics of the context data and the image data and fuse the two obtained characteristic vectors by acquiring the context data to be combined and the image data corresponding to the context data; the image data is input into the target detection model for feature extraction, so that image feature vectors corresponding to the image data can be obtained; the text feature vector corresponding to the context data can be obtained by inputting the context data into the word vector model for vectorization; the image feature vector and the text feature vector are input into the paragraph prediction model to conduct paragraph prediction, so that the paragraph prediction can be conducted after multi-mode information fusion, and the accuracy of a paragraph prediction result corresponding to the context data is improved; by combining the text belonging to the same paragraph in the context data according to the paragraph prediction result, the automatic identification and combination of the error segmentation in the editable document obtained by converting the non-editable document are realized, paragraph adjustment is not needed manually, and the usability of the editable document is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a paragraph identification method based on machine learning provided by an embodiment of the application;

FIG. 2 is a schematic diagram of image data provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of paragraph prediction for context data according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a sub-step of feature extraction of image data provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of preprocessing an image feature vector according to an embodiment of the present application;

FIG. 6 is a schematic diagram of preprocessing a text feature vector according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a paragraph prediction model according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a sub-step of paragraph prediction provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of paragraph prediction according to an embodiment of the present application;

FIG. 10 is a schematic block diagram of a paragraph identification device based on machine learning provided by an embodiment of the present application;

Fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

The embodiment of the application provides a paragraph identification method, a paragraph identification device, computer equipment and a paragraph identification medium based on machine learning. The paragraph identification method based on machine learning can be applied to a server or a terminal, and can be used for automatically identifying and combining error segments in an editable document obtained by converting an uneditable document without manually adjusting the segments by extracting characteristics of context data and image data corresponding to the context data and inputting the obtained image characteristic vector and the character characteristic vector into a paragraph prediction model to be fused and then predicting the paragraphs, so that the usability of the editable document is improved.

The servers may be independent servers or may be server clusters. The terminal can be electronic equipment such as a smart phone, a tablet computer, a notebook computer, a desktop computer and the like.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict.

As shown in fig. 1, the paragraph recognition method based on machine learning includes steps S10 to S40.

Step S10, obtaining context data to be combined, and obtaining image data corresponding to the context data, wherein the context data is characters corresponding to the image data.

It should be noted that, the paragraph identification method based on machine learning provided by the embodiment of the application can be applied to a scene that a non-editable document is converted into an editable document. Wherein the non-editable document may be a PDF document; the editable document may be a WORD document. It will be appreciated that in the document type conversion process, the most important requirement is to preserve the original format. Since PDF documents do not store formats themselves, the precondition for preserving formats is to first identify the exact location of each data format. In this location-based transformation, the concept of the paragraph is faded, resulting in the transformed WORD document not retaining the original paragraph format.

In some embodiments, obtaining context data to be consolidated may include: based on a preset type conversion strategy, carrying out type conversion on a first document to be subjected to document type conversion to obtain a corresponding second document; and determining the context data to be combined according to the characters of every two adjacent paragraphs in the second document.

By way of example, the first document to be document type converted may be a PDF document; the second document may be a WORD document, such as a document in DOC format or DOCX format.

By way of example, the preset type conversion policy may include OCR (Optical Character Recognition ) technology or PDF to WORD tools. It should be noted that, OCR is used to analyze and identify documents such as pictures and tables to obtain text and layout information.

For example, when performing type conversion on a PDF document to be subjected to document type conversion, OCR technology may be used to locate, identify, and convert various data formats (text, picture, form) in the PDF document, so as to obtain a WORD document.

It should be noted that, the WORD document obtained by performing type conversion on the PDF document includes a plurality of paragraphs, but the paragraphs in the PDF document may not be in one-to-one correspondence, and may be easily split into a plurality of lines of text, so that it is necessary to identify and combine erroneous segments in the WORD document.

For example, when determining the context data to be combined according to the text of each two adjacent paragraphs in the second document, the text of each two adjacent paragraphs in the WORD document may be used to determine the context data to be combined. The context data to be merged may thus comprise the text of two adjacent paragraphs, or may comprise the text of a plurality of adjacent paragraphs.

To further ensure privacy and security of the context data, the context data may be stored in a node of a blockchain.

The context data to be consolidated is illustratively as follows:

the federal learning technique can safely apply multiparty data without the privacy being known by the participants. Bank study in federal

When the method is applied, data exchange among federal members lacks data security and privacy protection requirements, and business application has security risks. There is a need to formulate phases.

When the context data is translated, the translation result is:

Federated learning technology can be used for secure multi-party data applications without the privacy being known by the participants.Banks study in the federal government

In the application,the data exchange between federates lacks data security and privacy protection requirements,and the business application has security risks.There is an urgent need to formulate a plan.

The bank translates the original coherent sentence into two sentences of Bank study IN THE FEDERAL government/In the application when the bank learns the federal study application, thereby causing the translation to be wrong. Therefore, it is necessary to perform paragraph recognition on the context and put together sentences belonging to the same paragraph, thereby increasing the usability of the converted document.

In some embodiments, acquiring the image data corresponding to the context data may include: image data is determined based on text regions in the first document corresponding to the context data.

Referring to fig. 2, fig. 2 is a schematic diagram of image data according to an embodiment of the application. As shown in fig. 2, after determining the context data to be combined, a text region in the PDF document corresponding to the context data may be determined as image data.

For example, text regions in a PDF document corresponding to context data may be captured, thereby obtaining image data. It is understood that the context data is text corresponding to the image data.

In some embodiments, after the image data corresponding to the context data is acquired, image preprocessing may also be performed on the image data. Among other things, image preprocessing may include, but is not limited to, binarization, laplace sharpening, rotation correction, and so forth. For example, when binarizing image data, a pixel having an RGB value of less than 127 may be valued at 0, and a pixel having an RGB value of greater than 127 may be valued at 255.

By acquiring the context data to be combined and the image data corresponding to the context data, the context data and the image data can be respectively subjected to feature extraction, and the two obtained feature vectors are fused, so that paragraph prediction is performed after multi-mode information fusion is realized.

And step S20, inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain a word feature vector corresponding to the context data.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating paragraph prediction of context data according to an embodiment of the present application. As shown in fig. 3, inputting image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, and inputting context data into a word vector model for vectorization to obtain a word feature vector corresponding to the context data; inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result corresponding to the context data. Therefore, paragraph prediction is carried out after the multi-mode information is fused, and accuracy of paragraph prediction results corresponding to the context data is improved.

By way of example, the target detection model may include, but is not limited to, fast R-CNN (Fast Region Convolutional Neural Network) model, SSD (Single Shot Detector) model, and YOLO model, among others. The target detection model at least comprises a region generation network layer and a feature extraction layer. It should be noted that the region generation network layer may include an RPN (RegionProposal Network, region generation network) layer for generating a plurality of candidate boxes; the feature extraction layer may include an ROI (Region of Interest ) Pooling for extracting features for each candidate box through a convolutional neural network.

In the embodiment of the application, the target detection model may be a Fast R-CNN model, and in the following, the Fast R-CNN model will be taken as an example to describe in detail how to perform feature extraction on image data.

Referring to fig. 4, fig. 4 is a schematic flowchart of a sub-step of feature extraction of image data according to an embodiment of the present application, which may specifically include the following steps S201 to S203.

Step 201, generating a network layer based on the region, adding candidate boxes to each line of text in the image data, and determining each two adjacent lines as a first target line and a second target line in sequence.

For example, the image data may be input into a region generation network layer, which adds candidate boxes to each line of text in the image data; each two adjacent rows are then determined to be a first target row and a second target row in turn.

By adding candidate frames to each line of text in the image data, each two adjacent lines are sequentially determined to be a first target line and a second target line, and then the image feature vector can be determined according to the candidate frames in the first target line and the second target line.

Step S202, determining a first position feature vector corresponding to a last candidate frame in the first target row, and determining a second position feature vector corresponding to a first candidate frame in the second target row based on the feature extraction layer.

In the embodiment of the application, during feature extraction, feature extraction is mainly performed on characters which are located between two punctuation marks and have cross rows. Therefore, the first position feature vector corresponding to the last candidate frame in the first target row and the second position feature vector corresponding to the first candidate frame in the second target row need to be extracted, so that the position feature vector which is between two punctuation marks and corresponds to the characters crossing the rows can be obtained.

For example, a first location feature vector corresponding to a last candidate box in a first target row may be extracted by the feature extraction layer, and a second location feature vector corresponding to a first candidate box in a second target row may be extracted.

It should be noted that, the Fast R-CNN model adopts a VGG16 network structure as a basic model, and convolves the image data with a plurality of convolution layers to obtain feature vectors with different scales, where the feature vectors are used for predicting position information corresponding to the image data.

Illustratively, the feature extraction layer may convolve the candidate frame with a convolution layer, so as to obtain a location feature vector.

Step S203, determining the image feature vector according to the first position feature vector and the second position feature vector.

For example, after the first position feature vector and the second position feature vector are obtained, the first position feature vector and the second position feature vector may be determined as image feature vectors. At this time, the image feature vector may include a plurality of position feature vectors, for example, a first position feature vector and a second position feature vector.

By determining the first position feature vector corresponding to the last candidate frame in the first target row and determining the second position feature vector corresponding to the first candidate frame in the second target row, feature extraction can be performed on characters which are located between two punctuation marks and cross rows, image feature vectors containing the position features of the characters are obtained, and accuracy of paragraph prediction is improved.

In some embodiments, the context data may be input into a word vector model for vectorization, so as to obtain a word feature vector corresponding to the context data. Word vector models may include, but are not limited to, BERT (Bidirectional Encoder Representations from Transformer) models, word2vec models, glove models, ELMo models, and so forth. In the embodiment of the application, taking a word vector model as a BERT model as an example, how to vectorize the context data is described in detail.

Illustratively, the BERT model may be a pre-trained model. For example, a large-scale text corpus unrelated to a specific NLP (Natural Language Processing ) task may be used in advance to train the BERT model, resulting in a trained BERT model. During training, the BERT model can take semantic vector representations of target words and context words as input through an Attention mechanism; then obtaining vector representation of the target word, vector representation of each word of the context and original value representation of each word of the target word and the context through linear transformation; and finally, calculating the similarity between the vector of the target word and the vector of each context word as a weight, and carrying out weighted fusion on the vector of the target word and the vector of each context word to be used as the output of the Attention, namely the enhanced semantic vector representation of the target word.

The text feature vector corresponding to the context data is obtained by inputting the context data into a trained BERT model for vectorization. It should be noted that, by inputting the context data into the BERT model for vectorization, a semantically enhanced text feature vector may be obtained.

And step S30, inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, and obtaining a paragraph prediction result corresponding to the context data.

In some embodiments, before inputting the image feature vector and the text feature vector into the paragraph prediction model for paragraph prediction, the method may further include: and respectively preprocessing the image feature vector and the text feature vector to obtain a target image feature vector and a target text feature vector.

For example, preprocessing may include weight value assignment, residual connection, and normalization. In the embodiment of the application, the weight value distribution can be realized through a self-attention weight (self-attention) layer, and the residual connection and normalization can be realized through a Forward propagation (Feed Forward) layer. Wherein, the residual connection refers to taking the input and output of the current layer as the input of the next layer. For example, the input of the current layer is X, the output is F (X), and the input of the next layer is F (X) +x.

It should be noted that the weight value assignment may increase the correlation between the self-features of the features. Residual connection can reduce the condition of gradient disappearance, and the input and the output of the current layer are spliced together to be used as the input of the next layer, so that even if the output of the current layer has a problem, the output of the current layer has no destructive effect. The normalization can improve the stability of the paragraph prediction model and avoid the influence of abnormal data on the paragraph prediction model. Therefore, the generalization of the paragraph prediction model can be effectively improved by preprocessing the image feature vector and the text feature vector respectively.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating preprocessing of an image feature vector according to an embodiment of the present application. As shown in fig. 5, the image feature vector output by the object detection model may be input into the self-attention weight layer to perform weight value distribution, so as to obtain an image feature vector after weight value distribution; and then inputting the image feature vector with the assigned weight value into a forward propagation layer for residual connection and normalization to obtain a target image feature vector.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating preprocessing of text feature vectors according to an embodiment of the present application. As shown in fig. 6, the text feature vector may be input into the self-attention weight layer to perform weight value allocation, so as to obtain the text feature vector after the weight value allocation; and then inputting the text feature vector with the assigned weight value into a forward propagation layer for residual connection and normalization to obtain a target text feature vector.

In the embodiment of the application, the paragraph prediction model can be a pre-trained model. The training process of the paragraph prediction model may include: acquiring training text data of a preset quantity and training image data corresponding to the training text data; determining a first feature vector corresponding to training image data based on the target detection model, and determining a second feature vector corresponding to training text data based on the word vector model; and inputting the first feature vector and the second feature vector into the paragraph prediction model for iterative training until the paragraph prediction model converges, so as to obtain a trained paragraph prediction model.

For example, when the first feature vector and the second feature vector are input into the paragraph prediction model for iterative training, training sample data of each training round can be determined according to the first feature vector and the second feature vector; inputting the current wheel training sample data into a paragraph prediction model for paragraph prediction training to obtain paragraph prediction results corresponding to the current wheel training sample data; determining a loss function value corresponding to a paragraph prediction result based on a preset loss function; and if the loss function value is greater than a preset loss value threshold, adjusting parameters of the paragraph prediction model, and performing the next training until the obtained loss function value is less than or equal to the loss value threshold, ending the training, and obtaining the trained paragraph prediction model.

By way of example, the preset loss functions may include, but are not limited to, 0-1 loss functions, absolute value loss functions, logarithmic loss functions, square loss functions, exponential loss functions, and the like.

The preset loss value threshold may be set according to actual situations, and specific values are not limited herein.

Illustratively, in adjusting parameters of the paragraph prediction model, this may be achieved by a gradient descent algorithm or a back propagation algorithm.

The accuracy of the trained paragraph prediction model can be improved by training the paragraph prediction model; the time required by convergence of the paragraph prediction model can be reduced and the training speed can be improved by calculating the loss function value of each round of training and adjusting the parameters of the paragraph prediction model according to the loss function value.

To further ensure the privacy and security of the trained paragraph prediction model, the trained paragraph prediction model may also be stored in a node of a blockchain. When a trained paragraph prediction model needs to be used, it can be invoked from the nodes of the blockchain.

In some embodiments, inputting the image feature vector and the text feature vector into the paragraph prediction model for paragraph prediction may include: and inputting the target image feature vector and the target text feature vector into a paragraph prediction model to conduct paragraph prediction.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a paragraph prediction model according to an embodiment of the application. As shown in FIG. 7, the paragraph prediction model includes a cross attention layer, a fusion layer, a self attention weight layer, a full connection layer, and an output layer. It should be noted that the cross attention layer is used to calculate the semantic correlation between two feature vectors; the fusion layer is used for carrying out weighted fusion on the feature vectors according to the semantic relativity. The self-attention weight layer is used for establishing connection and interaction between two feature vectors.

By inputting the image feature vector and the text feature vector into the paragraph prediction model for paragraph prediction, the paragraph prediction can be performed after the multi-mode information is fused, and the accuracy of the paragraph prediction result corresponding to the context data is improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of a substep of performing paragraph prediction according to an embodiment of the present application, which may specifically include the following steps S301 to S304.

Step 301, inputting the target image feature vector and the target text feature vector into the cross attention layer for semantic relevance calculation, and obtaining a semantic relevance matrix corresponding to the target text feature vector.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating paragraph prediction according to an embodiment of the application. As shown in fig. 9, the target image feature vector and the target text feature vector may be input into the cross attention layer for semantic relevance computation.

It should be noted that the cross attention layer may perform semantic relevance calculation through a similarity algorithm. The similarity algorithm may include, but is not limited to, euclidean distance, cosine similarity, jaccard similarity coefficient, pearson correlation coefficient, and the like.

For example, the semantic correlation between the target image feature vector and the target text feature vector may be calculated based on cosine similarity. The formula for calculating semantic relatedness is as follows:

Wherein V represents a target image feature vector; l represents a target text feature vector; r _i,j represents a semantic correlation matrix.

The similarity between the target image feature vector and the target text feature vector can be extracted by inputting the target image feature vector and the target text feature vector into the cross attention layer for semantic relevance calculation.

And step S302, inputting the target text feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target text feature vector.

In some embodiments, inputting the target text feature vector and the semantic correlation matrix into a fusion layer to obtain a feature fusion vector corresponding to the target text feature vector, including: determining a convolution kernel corresponding to the fusion layer according to the semantic correlation matrix; and convolving the target character feature vector based on the convolution kernel to obtain a feature fusion vector.

Illustratively, the size of the semantic correlation matrix may be n×n. In the embodiment of the application, the semantic correlation matrix can be determined as a convolution kernel, so that the size of the convolution kernel is n×n. The value of n may be determined according to practical situations, and specific numerical values are not limited herein.

Illustratively, when performing convolution, the convolution kernel may be moved over the target text feature vector to perform a dot product operation. The step length of the movement can be set according to practical situations, and specific numerical values are not limited herein.

By inputting the target text feature vector and the semantic correlation matrix into the fusion layer, the feature fusion vector corresponding to the target text feature vector can be obtained, fusion of the target image feature vector and the target text feature vector is realized, and further accuracy of a prediction result of a subsequent paragraph can be improved.

And step S303, inputting the feature fusion vector into the self-attention weight layer for weight value distribution, and obtaining a target feature fusion vector.

In some embodiments, the feature fusion vector is input into the self-attention weight layer to perform weight value allocation, and before obtaining the target feature fusion vector, the method may further include: and carrying out residual connection and normalization on the feature fusion vector to obtain a normalized feature fusion vector.

For example, the feature fusion vector may be sequentially subjected to residual connection and normalization to obtain a normalized feature fusion vector.

In some embodiments, inputting the feature fusion vector into the self-attention weight layer for weight value distribution to obtain a target feature fusion vector comprises: and inputting the normalized feature fusion vector into a self-attention weight layer for weight value distribution, and obtaining a target feature fusion vector.

The normalized feature fusion vector is input into a self-attention weight layer for weight value distribution to obtain a target feature fusion vector. The following is shown:

In the method, in the process of the invention, Representing the target feature fusion vector.

It should be noted that, in the embodiment of the present application, the residual connection and normalization may be performed on the data each time the sub-layer is passed.

The relevance among the self-characteristics of the characteristics can be increased by inputting the characteristic fusion vector into the self-attention weight layer to carry out weight value distribution; by carrying out residual connection and normalization on the feature fusion vector, the accuracy of the prediction of the subsequent paragraph can be improved.

Step S304, the target feature fusion vector is sequentially input into the full connection layer and the output layer, and the paragraph prediction result is obtained.

After the target feature fusion vector is obtained, the target feature fusion vector is input into the full connection layer and the output layer, and a paragraph prediction result corresponding to the context data can be obtained.

The full connection layer (Fully Connected layers, FC) is used to connect all features of the previous layer and to send the output value to the output layer. The output layer is used to classify the values of the full connection layer inputs, for example by a softmax function. In the embodiment of the application, the output layer can output the paragraph prediction result corresponding to the context data.

And step S40, combining the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

Illustratively, the paragraph prediction result may include 0 and 1. Wherein 1 represents the same paragraph; 0 represents a non-identical paragraph.

For example, when the paragraph prediction result is1, two paragraph characters in the explanatory context data belong to the same paragraph, and characters belonging to the same paragraph may be merged at this time.

For example, when the paragraph prediction result is 0, two paragraph characters in the context data do not belong to the same paragraph, and the two paragraph characters do not need to be combined.

By combining the text belonging to the same paragraph in the context data according to the paragraph prediction result, the automatic identification and combination of the error segmentation in the editable document obtained by converting the non-editable document are realized, paragraph adjustment is not needed manually, and the usability of the editable document is improved.

According to the paragraph identification method based on machine learning, the context data to be combined and the image data corresponding to the context data are acquired, the feature extraction can be carried out on the context data and the image data respectively, the two obtained feature vectors are fused, and paragraph prediction is carried out after multi-mode information fusion is achieved; by inputting the context data into the BERT model for vectorization, a semantic enhanced text feature vector can be obtained; the generalization of the paragraph prediction model can be effectively improved by respectively preprocessing the image feature vector and the text feature vector; the accuracy of the trained paragraph prediction model can be improved by training the paragraph prediction model; the time required by convergence of the paragraph prediction model can be reduced and the training speed can be improved by calculating the loss function value of each round of training and adjusting the parameters of the paragraph prediction model according to the loss function value; the image feature vector and the text feature vector are input into the paragraph prediction model to conduct paragraph prediction, so that the paragraph prediction can be conducted after multi-mode information fusion, and the accuracy of a paragraph prediction result corresponding to the context data is improved; the similarity between the target image feature vector and the target text feature vector can be extracted by inputting the target image feature vector and the target text feature vector into a cross attention layer for semantic correlation calculation; the feature fusion vector corresponding to the target text feature vector can be obtained by inputting the target text feature vector and the semantic correlation matrix into the fusion layer, so that the fusion of the target image feature vector and the target text feature vector is realized, and the accuracy of the prediction result of the subsequent paragraph can be improved; the relevance among the self-characteristics of the characteristics can be increased by inputting the characteristic fusion vector into the self-attention weight layer to carry out weight value distribution; the accuracy of the prediction of the subsequent paragraph can be improved by carrying out residual connection and normalization on the feature fusion vector; by combining the text belonging to the same paragraph in the context data according to the paragraph prediction result, the automatic identification and combination of the error segmentation in the editable document obtained by converting the non-editable document are realized, paragraph adjustment is not needed manually, and the usability of the editable document is improved.

Referring to fig. 10, fig. 10 is a schematic block diagram of a paragraph recognition device 1000 based on machine learning for performing the foregoing paragraph recognition method based on machine learning according to an embodiment of the present application. The paragraph recognition device based on machine learning can be configured in a server or a terminal.

As shown in fig. 10, the paragraph recognition device 1000 based on machine learning includes: a data acquisition module 1001, a feature extraction module 1002, a paragraph prediction module 1003, and a paragraph merging module 1004.

The data obtaining module 1001 is configured to obtain context data to be combined, and obtain image data corresponding to the context data, where the context data is text corresponding to the image data.

The feature extraction module 1002 is configured to input the image data into a target detection model to perform feature extraction, obtain an image feature vector corresponding to the image data, and input the context data into a word vector model to perform vectorization, obtain a text feature vector corresponding to the context data.

And a paragraph prediction module 1003, configured to input the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, so as to obtain a paragraph prediction result corresponding to the context data.

And a paragraph merging module 1004, configured to merge the text belonging to the same paragraph in the context data according to the paragraph prediction result.

It should be noted that, for convenience and brevity of description, the specific working process of the apparatus and each module described above may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 11.

Referring to fig. 11, fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a server or a terminal.

Referring to fig. 11, the computer device includes a processor and a memory connected by a system bus, wherein the memory may include a non-volatile storage medium and an internal memory.

The processor is used to provide computing and control capabilities to support the operation of the entire computer device.

The internal memory provides an environment for the execution of a computer program in a non-volatile storage medium that, when executed by a processor, causes the processor to perform any of a number of machine learning based paragraph recognition methods.

It should be appreciated that the Processor may be a central processing unit (Central Processing Unit, CPU), it may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein in one embodiment the processor is configured to run a computer program stored in the memory to implement the steps of:

Acquiring context data to be combined, and acquiring image data corresponding to the context data, wherein the context data is text corresponding to the image data; inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain a word feature vector corresponding to the context data; inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result corresponding to the context data; and merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

In one embodiment, the object detection model includes at least a region generation network layer and a feature extraction layer; the processor is used for realizing that when the image data is input into a target detection model to perform feature extraction and an image feature vector corresponding to the image data is obtained, the processor is used for realizing:

Generating a network layer based on the region, adding a candidate frame to each line of text in the image data, and sequentially determining each two adjacent lines as a first target line and a second target line; determining a first position feature vector corresponding to a last candidate frame in the first target row and a second position feature vector corresponding to a first candidate frame in the second target row based on the feature extraction layer; and determining the image feature vector according to the first position feature vector and the second position feature vector.

In one embodiment, the processor is further configured to, prior to implementing paragraph prediction by inputting the image feature vector and the text feature vector into a paragraph prediction model, implement:

And respectively preprocessing the image feature vector and the text feature vector to obtain a target image feature vector and a target text feature vector, wherein the preprocessing comprises weight value distribution, residual error connection and normalization.

In one embodiment, the processor, when implementing paragraph prediction by inputting the image feature vector and the text feature vector into a paragraph prediction model, is configured to implement:

and inputting the target image feature vector and the target text feature vector into the paragraph prediction model to perform paragraph prediction.

In one embodiment, the paragraph prediction model includes a cross attention layer, a fusion layer, a self attention weight layer, a full connection layer, and an output layer; the processor is configured to, when implementing paragraph prediction by inputting the target image feature vector and the target text feature vector into a paragraph prediction model, implement:

Inputting the target image feature vector and the target text feature vector into the cross attention layer for semantic relevance calculation to obtain a semantic relevance matrix corresponding to the target text feature vector; inputting the target character feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target character feature vector; inputting the feature fusion vector into the self-attention weight layer for weight value distribution to obtain a target feature fusion vector; and sequentially inputting the target feature fusion vector into the full-connection layer and the output layer to obtain the paragraph prediction result.

In one embodiment, when the processor inputs the target text feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target text feature vector, the processor is configured to implement:

Determining a convolution kernel corresponding to the fusion layer according to the semantic correlation matrix; and convolving the target character feature vector based on the convolution kernel to obtain the feature fusion vector.

In one embodiment, before implementing the feature fusion vector is input into the self-attention weight layer for weight value allocation, the processor is further configured to implement:

and carrying out residual connection and normalization on the feature fusion vector to obtain the normalized feature fusion vector.

In one embodiment, the processor is configured to, when implementing the inputting of the feature fusion vector into the self-attention weight layer for weight value allocation, obtain a target feature fusion vector, implement:

and inputting the normalized feature fusion vector into the self-attention weight layer to perform weight value distribution, so as to obtain the target feature fusion vector.

In one embodiment, the processor, when implementing obtaining the context data to be consolidated, is configured to implement:

Based on a preset type conversion strategy, carrying out type conversion on a first document to be subjected to document type conversion to obtain a corresponding second document; and determining the context data to be combined according to the characters of every two adjacent paragraphs in the second document.

In one embodiment, the processor, when implementing acquiring the image data corresponding to the context data, is configured to implement:

and determining the image data according to the text area corresponding to the context data in the first document.

An embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and the processor executes the program instructions to implement any of the paragraph identification methods based on machine learning provided in the embodiments of the present application.

The computer readable storage medium may be an internal storage unit of the computer device according to the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory Card (SMART MEDIA CARD, SMC), a Secure digital Card (Secure DIGITAL CARD, SD Card), a flash memory Card (FLASH CARD), etc. that are provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The blockchain (Blockchain), essentially a de-centralized database, is a string of data blocks that are generated in association using cryptographic methods, each of which contains information from a batch of network transactions for verifying the validity (anti-counterfeit) of its information and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

While the application has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A machine learning based paragraph identification method, comprising:

Acquiring context data to be combined, and acquiring image data corresponding to the context data, wherein the context data is characters corresponding to the image data, and the context data is characters of adjacent paragraphs in a WORD document;

combining the characters belonging to the same paragraph in the context data according to the paragraph prediction result;

The obtaining the context data to be combined includes: based on a preset type conversion strategy, carrying out type conversion on a first document to be subjected to document type conversion to obtain a corresponding second document; determining the context data to be combined according to the characters of every two adjacent paragraphs in the second document;

Before inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, the method further comprises: preprocessing the image feature vector and the text feature vector respectively to obtain a target image feature vector and a target text feature vector, wherein the preprocessing comprises weight value distribution, residual error connection and normalization;

inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, comprising: inputting the target image feature vector and the target text feature vector into the paragraph prediction model to perform paragraph prediction;

The paragraph prediction model comprises a cross attention layer, a fusion layer, a self attention weight layer, a full connection layer and an output layer; inputting the target image feature vector and the target text feature vector into a paragraph prediction model for paragraph prediction, comprising: inputting the target image feature vector and the target text feature vector into the cross attention layer for semantic relevance calculation to obtain a semantic relevance matrix corresponding to the target text feature vector; inputting the target character feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target character feature vector; inputting the feature fusion vector into the self-attention weight layer for weight value distribution to obtain a target feature fusion vector; and sequentially inputting the target feature fusion vector into the full-connection layer and the output layer to obtain the paragraph prediction result.

2. The machine learning based paragraph identification method of claim 1 wherein the object detection model comprises at least a region generation network layer and a feature extraction layer; inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, wherein the feature extraction comprises the following steps:

generating a network layer based on the region, adding a candidate frame to each line of text in the image data, and sequentially determining each two adjacent lines as a first target line and a second target line;

Determining a first position feature vector corresponding to a last candidate frame in the first target row and a second position feature vector corresponding to a first candidate frame in the second target row based on the feature extraction layer;

and determining the image feature vector according to the first position feature vector and the second position feature vector.

3. The machine learning-based paragraph identification method according to claim 1, wherein the inputting the target text feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target text feature vector comprises:

determining a convolution kernel corresponding to the fusion layer according to the semantic correlation matrix;

And convolving the target character feature vector based on the convolution kernel to obtain the feature fusion vector.

4. The machine learning based paragraph identification method according to claim 1, wherein before the inputting the feature fusion vector into the self-attention weight layer for weight value assignment, further comprising:

Carrying out residual connection and normalization on the feature fusion vector to obtain a normalized feature fusion vector;

the step of inputting the feature fusion vector into the self-attention weight layer for weight value distribution to obtain a target feature fusion vector comprises the following steps:

5. The machine learning based paragraph identification method according to any of claims 1-4 wherein the acquiring image data corresponding to the context data comprises:

6. A paragraph identification device based on machine learning for executing the paragraph identification method based on machine learning according to any one of claims 1 to 5, the paragraph identification device comprising:

the data acquisition module is used for acquiring context data to be combined and image data corresponding to the context data, wherein the context data is characters corresponding to the image data, and the context data is characters of adjacent paragraphs in a WORD document;

7. A computer device, the computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor configured to execute the computer program and implement the machine learning based paragraph identification method according to any of claims 1 to 5 when the computer program is executed.

8. A computer readable storage medium storing a computer program which when executed by a processor causes the processor to implement the machine learning based paragraph identification method according to any of claims 1 to 5.