CN113159013A

CN113159013A - Paragraph identification method and device based on machine learning, computer equipment and medium

Info

Publication number: CN113159013A
Application number: CN202110467091.XA
Authority: CN
Inventors: 吴天博; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-23

Abstract

The application relates to the field of artificial intelligence, and realizes automatic identification and combination of error segments in editable documents obtained by converting non-editable documents, thereby improving the usability of the editable documents. A paragraph recognition method, apparatus, computer device and medium based on machine learning are provided, the method includes: acquiring context data to be combined and image data corresponding to the context data; inputting image data into a target detection model for feature extraction to obtain image feature vectors of the image data, and inputting context data into a word vector model for vectorization to obtain character feature vectors; inputting the image feature vector and the character feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result of the context data; and merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result. In addition, the present application also relates to blockchain techniques, where context data may be stored in blockchains.

Description

Paragraph identification method and device based on machine learning, computer equipment and medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a computer device, and a medium for paragraph recognition based on machine learning.

Background

The PDF document is a document with a portable document format and is very convenient to use. However, in practical applications, since the PDF document is a non-editable document, there is a need for a user to convert the PDF document into an editable WORD document. The existing document conversion method mainly divides a PDF document into different blocks, identifies information such as characters, pictures, tables and the like in the different blocks, and finally merges the information corresponding to all the blocks so as to achieve the purpose of retaining the format. However, the original paragraphs in the PDF document are incorrectly segmented in the converted WORD document, which do not correspond to the original paragraphs, resulting in low usability of the WORD document.

Therefore, after converting an uneditable document into an editable document, how to improve the usability of the editable document becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a paragraph identification method, a paragraph identification device, computer equipment and a medium based on machine learning, wherein context data and image data corresponding to the context data are subjected to feature extraction, an obtained image feature vector and a character feature vector are input into a paragraph prediction model and fused for paragraph prediction, so that error segments in an editable document obtained by converting an uneditable document are automatically identified and merged without manually adjusting paragraphs, and the usability of the editable document is improved.

In a first aspect, the present application provides a paragraph recognition method based on machine learning, the method including:

acquiring context data to be combined and image data corresponding to the context data, wherein the context data are characters corresponding to the image data;

inputting the image data into a target detection model for feature extraction to obtain image feature vectors corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain character feature vectors corresponding to the context data;

inputting the image feature vector and the character feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result corresponding to the context data;

and merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

In a second aspect, the present application further provides a paragraph recognition apparatus based on machine learning, the apparatus including:

the data acquisition module is used for acquiring context data to be combined and acquiring image data corresponding to the context data, wherein the context data are characters corresponding to the image data;

the feature extraction module is used for inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain a character feature vector corresponding to the context data;

a paragraph prediction module, configured to input the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, so as to obtain a paragraph prediction result corresponding to the context data;

and the paragraph merging module is used for merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

In a third aspect, the present application further provides a computer device comprising a memory and a processor;

the memory for storing a computer program;

the processor is configured to execute the computer program and implement the paragraph identification method based on machine learning as described above when the computer program is executed.

In a fourth aspect, the present application also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement the method for paragraph identification based on machine learning as described above.

The application discloses a paragraph identification method, a paragraph identification device, computer equipment and a paragraph identification medium based on machine learning, wherein context data to be combined and image data corresponding to the context data are obtained, feature extraction can be subsequently carried out on the context data and the image data respectively, and the obtained two feature vectors are fused; image feature vectors corresponding to the image data can be obtained by inputting the image data into the target detection model for feature extraction; the context data are input into a word vector model for vectorization, so that character feature vectors corresponding to the context data can be obtained; by inputting the image feature vector and the character feature vector into the paragraph prediction model for paragraph prediction, the paragraph prediction can be performed after multi-modal information fusion is realized, and the accuracy of the paragraph prediction result corresponding to the context data is improved; by combining the characters belonging to the same paragraph in the context data according to the paragraph prediction result, the automatic identification and combination of the error segments in the editable document converted from the non-editable document are realized, manual paragraph adjustment is not needed, and the usability of the editable document is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flow chart of a paragraph identification method based on machine learning according to an embodiment of the present application;

FIG. 2 is a schematic diagram of image data provided by an embodiment of the present application;

FIG. 3 is a diagram illustrating paragraph prediction for context data according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a sub-step of feature extraction on image data provided by an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a preprocessing of an image feature vector according to an embodiment of the present disclosure;

FIG. 6 is a diagram illustrating an exemplary pre-processing of text feature vectors according to an embodiment of the present disclosure;

FIG. 7 is a schematic structural diagram of a paragraph prediction model provided in an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of a sub-step of paragraph prediction according to an embodiment of the present application;

FIG. 9 is a schematic diagram of paragraph prediction according to an embodiment of the present application;

FIG. 10 is a block diagram illustrating an apparatus for identifying paragraphs based on machine learning according to an embodiment of the present application;

fig. 11 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a paragraph identification method and device based on machine learning, computer equipment and a medium. The paragraph identification method based on machine learning can be applied to a server or a terminal, and can be used for automatically identifying and combining error segments in an editable document obtained by converting a non-editable document by performing feature extraction on context data and image data corresponding to the context data and inputting the obtained image feature vector and character feature vector into a paragraph prediction model for performing paragraph prediction after fusion, so that the paragraph adjustment is not required manually, and the usability of the editable document is improved.

The server may be an independent server or a server cluster. The terminal can be an electronic device such as a smart phone, a tablet computer, a notebook computer, a desktop computer and the like.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

As shown in fig. 1, the paragraph recognition method based on machine learning includes steps S10 to S40.

Step S10, obtaining context data to be merged, and obtaining image data corresponding to the context data, where the context data is a text corresponding to the image data.

It should be noted that the paragraph identification method based on machine learning provided by the embodiment of the present application may be applied to a scenario in which a non-editable document is converted into an editable document. Wherein, the non-editable document can be a PDF document; the editable document may be a WORD document. It will be appreciated that the most important requirement in the document type conversion process is to preserve the original format. Since PDF documents do not store the format themselves, preserving the format presupposes that the exact location of each data format is identified. In this location-based transformation approach, the concept of paragraphs is diluted, resulting in a transformed WORD document that does not retain the original paragraph format.

In some embodiments, obtaining context data to be merged may include: performing type conversion on a first document to be subjected to document type conversion based on a preset type conversion strategy to obtain a corresponding second document; and determining the context data to be merged according to the characters of every two adjacent paragraphs in the second document.

For example, the first document to be subjected to document type conversion may be a PDF document; the second document may be a WORD document, such as a document in DOC format or DOCX format.

For example, the preset type conversion strategy may include an OCR (Optical Character Recognition) technology or a PDF to WORD tool. The OCR technology is used for analyzing and recognizing documents such as pictures and tables to obtain text and layout information.

For example, when performing type conversion on a PDF document to be subjected to document type conversion, OCR technology may be adopted to locate, identify and convert various data formats (characters, pictures, tables) in the PDF document, so as to obtain a WORD document.

It should be noted that the WORD document obtained by performing type conversion on the PDF document includes a plurality of paragraphs, but the paragraphs at this time may not correspond to the paragraphs in the PDF document one-to-one, and are easily split into a plurality of lines of characters, so that it is necessary to identify and merge the erroneous segments in the WORD document.

For example, when determining the context data to be merged according to the WORDs of every two adjacent paragraphs in the second document, the context data to be merged may be determined according to the WORDs of every two adjacent paragraphs in the WORD document. The context data to be merged may thus comprise the text of two adjacent paragraphs or may comprise the text of a plurality of adjacent paragraphs.

To further ensure privacy and security of the context data, the context data may be stored in a node of a blockchain.

Illustratively, the context data to be merged is as follows:

the federated learning technique can safely conduct multi-party data applications without the privacy being known by the participants. Bank study in federal

When the method is applied, data exchange among federates lacks data security and privacy protection requirements, and business application has security risks. Phasing is required.

When context data is translated, the translation result is:

Federated learning technology can be used for secure multi-party data applications without the privacy being known by the participants.Banks study in the federal government

In the application,the data exchange between federates lacks data security and privacy protection requirements,and the business application has security risks.There is an urgent need to formulate a plan.

in which, what originally should be a coherent word, "bank In federal learning application" is translated into two words "bank study In the functional overview/In the application", which results In translation errors. Therefore, paragraph recognition needs to be performed on the context and sentences belonging to the same paragraph are all put together, thereby increasing the usability of the converted document.

In some embodiments, obtaining image data corresponding to the context data may include: and determining image data according to the text area corresponding to the context data in the first document.

Referring to fig. 2, fig. 2 is a schematic diagram of image data according to an embodiment of the present disclosure. As shown in fig. 2, after determining the context data to be merged, a text region in the PDF document corresponding to the context data may be determined as image data.

For example, a text region in the PDF document corresponding to the context data may be subjected to screen capture, thereby obtaining image data. It is understood that the context data is text corresponding to the image data.

In some embodiments, after the image data corresponding to the context data is obtained, image preprocessing may be performed on the image data. Image pre-processing may include, but is not limited to, binarization, laplacian sharpening, and rotation correction, among others. For example, when binarizing image data, a pixel point with an RGB value smaller than 127 may be set to be 0, and a pixel point with an RGB value larger than 127 may be set to be 255.

By acquiring the context data to be combined and the image data corresponding to the context data, feature extraction can be respectively carried out on the context data and the image data subsequently, and the two obtained feature vectors are fused, so that paragraph prediction is carried out after multi-modal information fusion is realized.

Step S20, inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain a text feature vector corresponding to the context data.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating paragraph prediction on context data according to an embodiment of the present application. As shown in fig. 3, inputting image data into a target detection model for feature extraction to obtain image feature vectors corresponding to the image data, and inputting context data into a word vector model for vectorization to obtain character feature vectors corresponding to the context data; and inputting the image feature vector and the character feature vector into a paragraph prediction model to perform paragraph prediction, and obtaining a paragraph prediction result corresponding to the context data. Therefore, the method and the device realize the paragraph prediction after the multi-mode information is fused, and improve the accuracy of the paragraph prediction result corresponding to the context data.

Exemplary target detection models may include, but are not limited to, Fast R-CNN (Fast Region conditional Neural network) models, SSD (Single Shot Detector) models, and YOLO models, among others. The target detection model at least comprises a region generation network layer and a feature extraction layer. It should be noted that the region generation Network layer may include an RPN (region generation Network) layer, configured to generate a plurality of candidate boxes; the feature extraction layer may include ROI (Region of Interest) Pooling for extracting features for each candidate box through a convolutional neural network.

In the embodiment of the present application, the target detection model may be a Fast R-CNN model, and how to perform feature extraction on image data will be described in detail below by taking the Fast R-CNN model as an example.

Referring to fig. 4, fig. 4 is a schematic flowchart of a sub-step of performing feature extraction on image data according to an embodiment of the present application, and specifically includes the following steps S201 to S203.

Step S201, generating a network layer based on the area, adding candidate frames to each line of characters in the image data, and sequentially determining each two adjacent lines as a first target line and a second target line.

For example, the image data may be input into an area generation network layer, and the area generation network layer adds candidate boxes to each line of characters in the image data; and then sequentially determining every two adjacent rows as a first target row and a second target row.

By adding candidate frames to characters in each line of the image data, each two adjacent lines are sequentially determined as a first target line and a second target line, and then image feature vectors can be determined according to the candidate frames in the first target line and the second target line.

Step S202, based on the feature extraction layer, determining a first position feature vector corresponding to a last candidate box in the first target row, and determining a second position feature vector corresponding to a first candidate box in the second target row.

In the embodiment of the application, during feature extraction, feature extraction is mainly performed on characters which are between two punctuations and have line crossing. Therefore, the position feature vector between two punctuations and corresponding to the cross-line text can be obtained by extracting the first position feature vector corresponding to the last candidate frame in the first target line and extracting the second position feature vector corresponding to the first candidate frame in the second target line.

For example, a first position feature vector corresponding to the last candidate box in the first target row may be extracted, and a second position feature vector corresponding to the first candidate box in the second target row may be extracted by the feature extraction layer.

It should be noted that the Fast R-CNN model uses a VGG16 network structure as a basic model, convolves image data by a plurality of convolution layers to obtain feature vectors of different scales, and the feature vectors are used for predicting position information corresponding to the image data.

For example, the feature extraction layer may perform convolution processing on the candidate frame by the convolution layer, so that the position feature vector may be obtained.

Step S203, determining the image feature vector according to the first position feature vector and the second position feature vector.

For example, after obtaining the first position feature vector and the second position feature vector, the first position feature vector and the second position feature vector may be determined as the image feature vector. At this time, the image feature vector may include a plurality of position feature vectors, for example, a first position feature vector and a second position feature vector.

By determining the first position feature vector corresponding to the last candidate frame in the first target line and determining the second position feature vector corresponding to the first candidate frame in the second target line, feature extraction can be performed on the characters which are between two punctuation marks and have line crossing, the image feature vector containing the position features of the characters is obtained, and the accuracy of subsequent paragraph prediction is improved.

In some embodiments, the context data may be input into a word vector model for vectorization, so as to obtain a text feature vector corresponding to the context data. The word vector model may include, but is not limited to, bert (bidirectional Encoder expressions from transform) model, word2vec model, glove model, and ELMo model, among others. In the embodiment of the present application, a word vector model is taken as an example of a BERT model, and how to vectorize context data is described in detail.

Illustratively, the BERT model may be a pre-trained model. For example, the BERT model may be trained in advance using a large-scale text corpus that is not related to a specific NLP (Natural Language Processing) task, so as to obtain a trained BERT model. During training, the BERT model may take semantic vector representations of the target word and each word of the context as input through an Attention mechanism; then obtaining vector representation of the target word, vector representation of each word of the context and original value representation of the target word and each word of the context through linear transformation; and finally, calculating the similarity between the vector of the target word and the vector of each word of the context as weight, and performing weighted fusion on the vector of the target word and the vectors of each upper character and each lower character to be used as the output of Attention, namely the enhanced semantic vector representation of the target word.

Illustratively, the context data is input into a trained BERT model for vectorization, and a character feature vector corresponding to the context data is obtained. It should be noted that, by inputting the context data into the BERT model for vectorization, a semantic enhanced character feature vector can be obtained.

Step S30, inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, and obtaining a paragraph prediction result corresponding to the context data.

In some embodiments, before inputting the image feature vector and the text feature vector into the paragraph prediction model for paragraph prediction, the method may further include: and respectively preprocessing the image characteristic vector and the character characteristic vector to obtain a target image characteristic vector and a target character characteristic vector.

Exemplary preprocessing may include weight value assignment, residual concatenation, and normalization. In the embodiment of the present application, weight value assignment may be implemented by a self-attention weight (self-attention) layer, and residual concatenation and normalization may be implemented by a Forward propagation (Feed Forward) layer. The residual connection refers to taking the input and output of the current layer as the input of the next layer. For example, the input of the current layer is X, the output is F (X), and the input of the next layer is F (X) + X.

It should be noted that the weight value assignment may increase the correlation between the features themselves. Residual connection can reduce the situation of gradient disappearance, and the input and the output of the current layer are spliced together to be used as the input of the next layer, so that even if the output of the current layer goes wrong, the destructive influence can not be caused. The normalization can improve the stability of the paragraph prediction model and avoid the influence of abnormal data on the paragraph prediction model. Therefore, the generalization of the paragraph prediction model can be effectively improved by respectively preprocessing the image feature vector and the character feature vector.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating an embodiment of preprocessing an image feature vector. As shown in fig. 5, the image feature vector output by the target detection model may be input into the attention weighting layer for weight value assignment, so as to obtain an image feature vector after weight value assignment; and then inputting the image feature vectors after the weight values are distributed into a forward propagation layer for residual connection and normalization to obtain target image feature vectors.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating preprocessing of a text feature vector according to an embodiment of the present disclosure. As shown in fig. 6, the text feature vector may be input from the attention weight layer for weight value assignment, so as to obtain a text feature vector after weight value assignment; and then inputting the character feature vectors after the weight values are distributed into a forward propagation layer for residual connection and normalization to obtain target character feature vectors.

In the embodiment of the present application, the paragraph prediction model may be a pre-trained model. The training process of the paragraph prediction model may include: acquiring training character data of a preset number and acquiring training image data corresponding to the training character data; determining a first feature vector corresponding to training image data based on a target detection model, and determining a second feature vector corresponding to training character data based on a word vector model; and inputting the first feature vector and the second feature vector into the paragraph prediction model for iterative training until the paragraph prediction model converges to obtain a trained paragraph prediction model.

For example, when the first feature vector and the second feature vector are input into the paragraph prediction model for iterative training, training sample data for each round of training may be determined according to the first feature vector and the second feature vector; inputting the sample data of the current round of training into a paragraph prediction model for paragraph prediction training to obtain a paragraph prediction result corresponding to the sample data of the current round of training; determining a loss function value corresponding to the paragraph prediction result based on a preset loss function; and if the loss function value is larger than the preset loss value threshold, adjusting parameters of the paragraph prediction model, carrying out next round of training until the obtained loss function value is smaller than or equal to the loss value threshold, and finishing the training to obtain the trained paragraph prediction model.

Exemplary, the predetermined loss function may include, but is not limited to, a 0-1 loss function, an absolute value loss function, a logarithmic loss function, a quadratic loss function, an exponential loss function, and the like.

The preset loss value threshold may be set according to actual conditions, and the specific value is not limited herein.

Illustratively, when adjusting the parameters of the paragraph prediction model, the above-mentioned adjustment may be implemented by a gradient descent algorithm or a back propagation algorithm.

By training the paragraph prediction model, the accuracy of the trained paragraph prediction model can be improved; by calculating the loss function value of each round of training and adjusting the parameters of the section prediction model according to the loss function value, the time required by convergence of the section prediction model can be reduced, and the training speed is increased.

To further ensure the privacy and security of the trained segment prediction model, the trained segment prediction model may also be stored in a node of a block chain. When the trained paragraph prediction model needs to be used, the paragraph prediction model can be called from the nodes of the block chain.

In some embodiments, inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction may include: and inputting the target image feature vector and the target character feature vector into a paragraph prediction model to perform paragraph prediction.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a paragraph prediction model according to an embodiment of the present application. As shown in fig. 7, the paragraph prediction model includes a cross attention layer, a fusion layer, a self-attention weight layer, a fully-connected layer, and an output layer. It should be noted that, the cross attention layer is used to calculate semantic correlation between two feature vectors; and the fusion layer is used for performing weighted fusion on the feature vectors according to the semantic relevance. The self-attention weighting layer is used for establishing connection and interaction between two feature vectors.

By inputting the image feature vector and the character feature vector into the paragraph prediction model for paragraph prediction, the paragraph prediction can be performed after multi-mode information fusion, and the accuracy of the paragraph prediction result corresponding to the context data is improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of sub-steps of performing paragraph prediction according to an embodiment of the present application, which may specifically include the following steps S301 to S304.

Step S301, inputting the target image feature vector and the target character feature vector into the cross attention layer for semantic correlation calculation, and obtaining a semantic correlation matrix corresponding to the target character feature vector.

Referring to fig. 9, fig. 9 is a schematic diagram illustrating paragraph prediction according to an embodiment of the present application. As shown in fig. 9, the target image feature vector and the target text feature vector may be input into the cross attention layer for semantic relevance calculation.

It should be noted that the cross-attention layer may perform semantic relevance calculation through a similarity algorithm. The similarity algorithm may include, but is not limited to, algorithms such as euclidean distance, cosine similarity, Jaccard similarity coefficient, and Pearson correlation coefficient.

Illustratively, the semantic correlation between the target image feature vector and the target text feature vector can be calculated based on cosine similarity. The formula for calculating semantic relevance is as follows:

in the formula, V represents a target image feature vector; l represents a target character feature vector; r_i,jA semantic correlation matrix is represented.

The similarity between the target image feature vector and the target character feature vector can be extracted by inputting the target image feature vector and the target character feature vector into a cross attention layer for semantic correlation calculation.

Step S302, inputting the target character feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target character feature vector.

In some embodiments, inputting the target text feature vector and the semantic correlation matrix into the fusion layer, and obtaining a feature fusion vector corresponding to the target text feature vector, includes: determining a convolution kernel corresponding to the fusion layer according to the semantic correlation matrix; and (4) carrying out convolution on the target character feature vector based on the convolution kernel to obtain a feature fusion vector.

Illustratively, the size of the semantic correlation matrix may be n. In the embodiment of the present application, the semantic correlation matrix may be determined as a convolution kernel, so that the size of the convolution kernel is n × n. The value of n may be determined according to actual conditions, and the specific value is not limited herein.

For example, when performing convolution, the convolution kernel may move on the target text feature vector to perform a dot product operation. The moving step length can be set according to actual conditions, and the specific numerical value is not limited herein.

By inputting the target character feature vector and the semantic correlation matrix into the fusion layer, the feature fusion vector corresponding to the target character feature vector can be obtained, the target image feature vector and the target character feature vector are fused, and the accuracy of the prediction result of the subsequent paragraph can be improved.

Step S303, inputting the feature fusion vector into the self-attention weight layer for weight value distribution, and obtaining a target feature fusion vector.

In some embodiments, inputting the feature fusion vector from the attention weighting layer for weight value assignment, before obtaining the target feature fusion vector, may further include: and performing residual connection and normalization on the feature fusion vector to obtain the normalized feature fusion vector.

For example, the feature fusion vectors may be sequentially subjected to residual connection and normalization to obtain normalized feature fusion vectors.

In some embodiments, inputting the feature fusion vector from the attention weighting layer for weight value assignment to obtain a target feature fusion vector includes: and inputting the normalized feature fusion vector into a self-attention weight layer for weight value distribution to obtain a target feature fusion vector.

Illustratively, the normalized feature fusion vector is input from the attention weight layer for weight value distribution, and a target feature fusion vector is obtained. As follows:

in the formula (I), the compound is shown in the specification,

representing the target feature fusion vector.

It should be noted that, in the embodiment of the present application, residual concatenation and normalization may be performed on data each time a sub-layer is passed.

The relevance between the characteristics of the characteristics can be increased by inputting the characteristic fusion vector into the attention weight layer to carry out weight value distribution; the accuracy of prediction of subsequent paragraphs can be improved by performing residual connection and normalization on the feature fusion vectors.

And S304, sequentially inputting the target feature fusion vector into the full-connection layer and the output layer to obtain the paragraph prediction result.

After the target feature fusion vector is obtained, the target feature fusion vector is input into the full-link layer and the output layer, and a paragraph prediction result corresponding to the context data can be obtained.

Note that a full Connected layers (FC) is used to connect all the features of the upper layer and send the output value to the output layer. The output layer is used to classify the values of the fully-connected layer input, for example by means of a softmax function. In an embodiment of the present application, the output layer may output a paragraph prediction result corresponding to the context data.

And step S40, merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

Illustratively, the paragraph prediction result may include 0 and 1. Wherein 1 represents the same paragraph; 0 represents a non-identical paragraph.

For example, when the paragraph prediction result is 1, it indicates that two paragraph characters in the context data belong to the same paragraph, and the characters belonging to the same paragraph may be merged.

For example, when the paragraph prediction result is 0, it indicates that two paragraph texts in the context data do not belong to the same paragraph, and the texts of the two paragraphs do not need to be merged.

By combining the characters belonging to the same paragraph in the context data according to the paragraph prediction result, the automatic identification and combination of the error segments in the editable document converted from the non-editable document are realized, manual paragraph adjustment is not needed, and the usability of the editable document is improved.

In the paragraph identification method based on machine learning provided by the above embodiment, by obtaining context data to be merged and obtaining image data corresponding to the context data, feature extraction can be subsequently performed on the context data and the image data respectively, and the obtained two feature vectors are fused, so that paragraph prediction is performed after multi-modal information fusion is realized; the context data is input into a BERT model for vectorization, so that a character feature vector with enhanced semantics can be obtained; by respectively preprocessing the image characteristic vector and the character characteristic vector, the generalization of the paragraph prediction model can be effectively improved; by training the paragraph prediction model, the accuracy of the trained paragraph prediction model can be improved; by calculating the loss function value of each round of training and adjusting the parameters of the section prediction model according to the loss function value, the time required by convergence of the section prediction model can be reduced, and the training speed is increased; by inputting the image feature vector and the character feature vector into the paragraph prediction model for paragraph prediction, the paragraph prediction can be performed after multi-modal information fusion is realized, and the accuracy of the paragraph prediction result corresponding to the context data is improved; the similarity between the target image feature vector and the target character feature vector can be extracted by inputting the target image feature vector and the target character feature vector into a cross attention layer for semantic correlation calculation; by inputting the target character feature vector and the semantic correlation matrix into the fusion layer, the feature fusion vector corresponding to the target character feature vector can be obtained, the target image feature vector and the target character feature vector are fused, and the accuracy of the prediction result of the subsequent paragraph can be improved; the relevance between the characteristics of the characteristics can be increased by inputting the characteristic fusion vector into the attention weight layer to carry out weight value distribution; residual error connection and normalization are carried out on the feature fusion vector, so that the accuracy of prediction of subsequent paragraphs can be improved; by combining the characters belonging to the same paragraph in the context data according to the paragraph prediction result, the automatic identification and combination of the error segments in the editable document converted from the non-editable document are realized, manual paragraph adjustment is not needed, and the usability of the editable document is improved.

Referring to fig. 10, fig. 10 is a schematic block diagram of a paragraph recognition apparatus 1000 based on machine learning according to an embodiment of the present application, which is used for executing the paragraph recognition method based on machine learning. Wherein, the paragraph identifying device based on machine learning can be configured in a server or a terminal.

As shown in fig. 10, the paragraph recognition apparatus 1000 based on machine learning includes: a data acquisition module 1001, a feature extraction module 1002, a paragraph prediction module 1003, and a paragraph merge module 1004.

The data obtaining module 1001 is configured to obtain context data to be merged and obtain image data corresponding to the context data, where the context data is a text corresponding to the image data.

The feature extraction module 1002 is configured to input the image data into a target detection model for feature extraction, to obtain an image feature vector corresponding to the image data, and input the context data into a word vector model for vectorization, to obtain a text feature vector corresponding to the context data.

A paragraph prediction module 1003, configured to input the image feature vector and the text feature vector into a paragraph prediction model to perform paragraph prediction, so as to obtain a paragraph prediction result corresponding to the context data.

A paragraph merging module 1004, configured to merge the words belonging to the same paragraph in the context data according to the paragraph prediction result.

It should be noted that, as will be clear to those skilled in the art, for convenience and brevity of description, the specific working processes of the apparatus and the modules described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The apparatus described above may be implemented in the form of a computer program which is executable on a computer device as shown in fig. 11.

Referring to fig. 11, fig. 11 is a schematic block diagram of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal.

Referring to fig. 11, the computer device includes a processor and a memory connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by the processor, causes the processor to perform any of the methods for paragraph identification based on machine learning.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring context data to be combined and image data corresponding to the context data, wherein the context data are characters corresponding to the image data; inputting the image data into a target detection model for feature extraction to obtain image feature vectors corresponding to the image data, and inputting the context data into a word vector model for vectorization to obtain character feature vectors corresponding to the context data; inputting the image feature vector and the character feature vector into a paragraph prediction model for paragraph prediction to obtain a paragraph prediction result corresponding to the context data; and merging the characters belonging to the same paragraph in the context data according to the paragraph prediction result.

In one embodiment, the object detection model includes at least a region generation network layer and a feature extraction layer; when the processor inputs the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data, the processor is used for realizing:

generating a network layer based on the region, adding candidate frames to characters of each line in the image data, and sequentially determining each two adjacent lines as a first target line and a second target line; determining a first position feature vector corresponding to a last candidate frame in the first target row and determining a second position feature vector corresponding to a first candidate frame in the second target row based on the feature extraction layer; and determining the image feature vector according to the first position feature vector and the second position feature vector.

In one embodiment, the processor, prior to implementing the paragraph prediction by inputting the image feature vector and the text feature vector into a paragraph prediction model, is further configured to implement:

and respectively preprocessing the image characteristic vector and the character characteristic vector to obtain a target image characteristic vector and a target character characteristic vector, wherein the preprocessing comprises weight value distribution, residual connection and normalization.

In one embodiment, the processor, when implementing paragraph prediction by inputting the image feature vector and the text feature vector into a paragraph prediction model, is configured to implement:

and inputting the target image feature vector and the target character feature vector into the paragraph prediction model for paragraph prediction.

In one embodiment, the paragraph prediction model includes a cross attention layer, a fusion layer, a self-attention weight layer, a fully-connected layer, and an output layer; when the processor is used for inputting the target image feature vector and the target character feature vector into a paragraph prediction model for paragraph prediction, the processor is used for realizing that:

inputting the target image feature vector and the target character feature vector into the cross attention layer to perform semantic correlation calculation, and obtaining a semantic correlation matrix corresponding to the target character feature vector; inputting the target character feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target character feature vector; inputting the feature fusion vector into the self-attention weight layer to carry out weight value distribution to obtain a target feature fusion vector; and sequentially inputting the target feature fusion vector into the full-connection layer and the output layer to obtain the paragraph prediction result.

In one embodiment, when the processor is implemented to input the target text feature vector and the semantic relation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target text feature vector, the processor is configured to implement:

determining a convolution kernel corresponding to the fusion layer according to the semantic correlation matrix; and performing convolution on the target character feature vector based on the convolution kernel to obtain the feature fusion vector.

In one embodiment, before implementing the inputting of the feature fusion vector into the self-attention weighting layer for weight value assignment to obtain a target feature fusion vector, the processor is further configured to implement:

and performing residual connection and normalization on the feature fusion vector to obtain the normalized feature fusion vector.

In one embodiment, the processor, when implementing inputting the feature fusion vector into the self-attention weighting layer for weight value assignment to obtain a target feature fusion vector, is configured to implement:

and inputting the normalized feature fusion vector into the self-attention weight layer for weight value distribution to obtain the target feature fusion vector.

In one embodiment, the processor, when being configured to obtain context data to be merged, is configured to:

performing type conversion on a first document to be subjected to document type conversion based on a preset type conversion strategy to obtain a corresponding second document; and determining the context data to be merged according to the characters of every two adjacent paragraphs in the second document.

In one embodiment, the processor, when being configured to obtain the image data corresponding to the context data, is configured to:

and determining the image data according to a text area corresponding to the context data in the first document.

The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, where the computer program includes program instructions, and the processor executes the program instructions to implement any paragraph identification method based on machine learning provided in the embodiment of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital Card (SD Card), a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A paragraph identification method based on machine learning is characterized by comprising the following steps:

2. The machine-learning-based paragraph recognition method of claim 1, wherein the object detection model comprises at least a region generation network layer and a feature extraction layer; the inputting the image data into a target detection model for feature extraction to obtain an image feature vector corresponding to the image data includes:

generating a network layer based on the region, adding candidate frames to characters of each line in the image data, and sequentially determining each two adjacent lines as a first target line and a second target line;

determining a first position feature vector corresponding to a last candidate frame in the first target row and determining a second position feature vector corresponding to a first candidate frame in the second target row based on the feature extraction layer;

and determining the image feature vector according to the first position feature vector and the second position feature vector.

3. The method of claim 1, wherein before entering the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, the method further comprises:

respectively preprocessing the image characteristic vector and the character characteristic vector to obtain a target image characteristic vector and a target character characteristic vector, wherein the preprocessing comprises weight value distribution, residual connection and normalization;

inputting the image feature vector and the text feature vector into a paragraph prediction model for paragraph prediction, including:

4. The machine-learning-based paragraph recognition method of claim 3, wherein the paragraph prediction model comprises a cross attention layer, a fusion layer, a self attention weight layer, a fully connected layer, and an output layer;

inputting the target image feature vector and the target character feature vector into a paragraph prediction model for paragraph prediction, including:

inputting the target image feature vector and the target character feature vector into the cross attention layer to perform semantic correlation calculation, and obtaining a semantic correlation matrix corresponding to the target character feature vector;

inputting the target character feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target character feature vector;

inputting the feature fusion vector into the self-attention weight layer to carry out weight value distribution to obtain a target feature fusion vector;

and sequentially inputting the target feature fusion vector into the full-connection layer and the output layer to obtain the paragraph prediction result.

5. The method for identifying paragraphs based on machine learning according to claim 4, wherein the inputting the target text feature vector and the semantic correlation matrix into the fusion layer to obtain a feature fusion vector corresponding to the target text feature vector comprises:

determining a convolution kernel corresponding to the fusion layer according to the semantic correlation matrix;

and performing convolution on the target character feature vector based on the convolution kernel to obtain the feature fusion vector.

6. The method according to claim 4, wherein before inputting the feature fusion vector into the self-attention weighting layer for weight value assignment to obtain a target feature fusion vector, the method further comprises:

performing residual connection and normalization on the feature fusion vector to obtain the normalized feature fusion vector;

the inputting the feature fusion vector into the self-attention weighting layer for weight value distribution to obtain a target feature fusion vector includes:

7. The method for identifying paragraphs based on machine learning according to any of claims 1-6, wherein the obtaining context data to be merged comprises:

performing type conversion on a first document to be subjected to document type conversion based on a preset type conversion strategy to obtain a corresponding second document;

determining the context data to be merged according to the characters of every two adjacent paragraphs in the second document;

the acquiring of the image data corresponding to the context data includes:

8. A paragraph recognition apparatus based on machine learning, comprising:

9. A computer device, wherein the computer device comprises a memory and a processor;

the memory for storing a computer program;

the processor for executing the computer program and when executing the computer program implementing a method of paragraph identification based on machine learning according to any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the machine learning-based paragraph identification method according to any one of claims 1 to 7.