CN116912845B

CN116912845B - Intelligent content identification and analysis method and device based on NLP and AI

Info

Publication number: CN116912845B
Application number: CN202310726304.5A
Authority: CN
Inventors: 杜家兵; 王晶; 宋才华; 吴丽贤; 皇甫汉聪; 关兆雄; 陈旭宇; 庞伟林; 庞维欣; 李仰杰
Original assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Current assignee: Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2024-03-19
Anticipated expiration: 2043-06-16
Also published as: CN116912845A

Abstract

The invention discloses an intelligent content identification and analysis method and device based on NLP and AI, wherein the method comprises the following steps: acquiring an original file image, and optimizing the file image by using an image enhancement processing technology to acquire an optimized file image; processing the file image after optimization processing based on an image character cutting method to obtain an independent character image; recognizing the independent character image based on a character recognition algorithm of the binary image to obtain text information; carrying out data preprocessing on the text information to obtain preprocessed text information; performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors; the feature vector is input into a language model, and the text information is classified and analyzed based on the language model. The invention adopts NLP and AI technology, realizes intelligent content identification and analysis, provides rich information input, and provides big data support for intelligent operation of enterprises.

Description

Intelligent content identification and analysis method and device based on NLP and AI

Technical Field

The invention relates to the technical field of natural language processing and artificial intelligence, in particular to an intelligent content identification and analysis method and device based on NLP and AI.

Background

At present, AI technology is vigorously developed, AI products are widely applied to daily life of people, and in the process of AI technology development, one technology plays an indispensable role, namely NLP technology. NLP technology is a very important direction in the field of artificial intelligence today, and its purpose is to achieve efficient communication between humans and computer programs using natural language. At present, a rough management mode of taking the whole electronic file as a management unit is adopted in a traditional service system, and in order to change the mode, NLP and AI technologies are adopted, so that intelligent content identification and analysis are realized, rich information input is provided for various service systems, and therefore large data support is provided for intelligent operation of enterprises.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, and provides an intelligent content identification and analysis method and device based on NLP and AI, which adopt NLP and AI technology to realize intelligent content identification and analysis and provide rich information input for various business systems, thereby providing large data support for intelligent operation of enterprises.

In order to solve the technical problems, an embodiment of the present invention provides an intelligent content identification and analysis method based on NLP and AI, the method comprising:

Acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;

processing the file image after optimization processing based on an image character cutting method to obtain independent character images;

performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;

carrying out data preprocessing on the text information to obtain preprocessed text information;

performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;

inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.

Optionally, the obtaining the original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image, including:

acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image;

performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing;

Performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing;

and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.

Optionally, the image-based character cutting method processes the file image after the optimization processing to obtain an independent character image, and includes:

performing binarization processing on the file image after optimization processing to obtain a binarized image;

carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton;

and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.

Optionally, the binary image-based character recognition algorithm performs recognition processing on the independent character image to obtain text information, and includes:

performing binarization processing on the independent character image to obtain a binarized independent character image;

extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features;

Inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters;

and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.

Optionally, the performing data preprocessing on the text information to obtain preprocessed text information includes:

carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment;

word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained;

and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.

Optionally, the text vectorization processing is performed on the preprocessed text information based on the feature extraction method to obtain feature vectors, which includes:

carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors;

and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.

Optionally, the structure of the CNN model includes:

the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector.

Optionally, the inputting the feature vector into a language model, and performing classification and parsing processing of text information based on the language model includes:

constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer;

training the initial language model to obtain a trained language model;

and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.

Optionally, the training process of the language model includes:

collecting text information data, and labeling the text information data to obtain labeled text information data;

performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels;

importing the feature vector with the label into an initial language model;

dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set;

training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model;

Testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model;

and optimizing the model based on the result of the test set.

In addition, the embodiment of the invention also provides an intelligent content recognition and analysis device based on NLP and AI, which comprises:

and the optimization processing module is used for: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;

and a character cutting module: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;

and a character recognition module: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;

and a pretreatment module: carrying out data preprocessing on the text information to obtain preprocessed text information;

and the feature extraction module is used for: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;

classification and analysis module: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.

In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of an intelligent content recognition and analysis method based on NLP and AI;

Fig. 2 is a schematic structural diagram of an intelligent content recognition and analysis device based on NLP and AI.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, fig. 1 is a flow chart of an intelligent content recognition and analysis method based on NLP and AI in an embodiment of the invention.

As shown in fig. 1, a method for identifying and analyzing intelligent content based on NLP and AI, the method comprising:

s11: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;

in the implementation process of the invention, the method for obtaining the original file image and optimizing the original file image by using the image enhancement processing technology to obtain the optimized file image comprises the following steps: acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image; performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing; performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing; and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.

Specifically, an original file image is obtained, and denoising is carried out on the original file image by using a median filtering method, wherein the median filtering method is used for replacing the value of one point in the image with the median value of each point value in one field of the point, surrounding pixel values are close to a true value, isolated noise points are eliminated, and the denoised file image is obtained; carrying out enhanced image edge processing on the de-noised file image based on a Hilbert transform method, firstly modulating the object field of the file image in the frequency domain, carrying out Fourier transform on the frequency domain, taking the imaginary part of the frequency domain after the Fourier transform, taking the absolute value, calculating the 1/n power after summing along the frequency, and carrying out edge enhancement along the vertical axis and the horizontal axis respectively to obtain the file image after the enhanced image edge processing; performing bending correction processing on a file image subjected to edge processing of an enhanced image based on an offset field method, firstly forming an offset field through a deformation correction network, and enabling the offset field to perform corresponding displacement on each pixel of the file image to obtain the file image subjected to bending correction processing; performing handwriting erasing processing on the file image subjected to the bending correction processing based on a handwriting erasing clustering method, automatically classifying similar objects into the same cluster by continuously taking data of the nearest mean value from the centroid of the image, circularly executing until the clustering is completed, and correcting the part of the image needing handwriting erasing after the clustering is completed to obtain the file image subjected to the optimization processing; the image enhancement technology is used for processing the image, so that the useful information of the image can be enhanced, and the quality of the image can be improved, so that the image can be further processed.

S12: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;

in the implementation process of the invention, the method for cutting the characters based on the images processes the file images after the optimization processing to obtain independent character images, and comprises the following steps: performing binarization processing on the file image after optimization processing to obtain a binarized image; carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton; and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.

Specifically, binarizing the file image after optimization, graying the image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of the gray level range or the average value of the image, so as to obtain a binarized image which can still reflect the whole and local characteristics of the image, and the binarized image is convenient for the subsequent processing of a thinning algorithm; firstly, setting a neighborhood template based on a refinement algorithm to refine the binarized image, judging whether each binarized image pixel point meets the following conditions, wherein the first condition bit ensures that the black points [2,6] of a central pixel are between, the second condition ensures that the deletion of the pixel point does not influence the connectivity of the image, the third condition is that the deletion of the pixel point does not generate horizontal line fracture, the fourth condition is that the deletion of the pixel point does not generate vertical line fracture, and the pixel point is deleted when the four conditions are met, and repeating the deletion operation until the pixel point which does not meet the conditions is obtained; after a character skeleton is obtained by using a thinning algorithm, positioning crossing points according to the character skeleton, setting an optimal interval, superposing character images according to the position relation of the character images on an original image, using thickening processing to obtain a dividing line, judging the length of the crossing points from the dividing line, defining a fault tolerance for the length, wherein crossing points in the fault tolerance are necessary crossing points, eliminating unnecessary crossing points, and performing cutting operation to obtain independent character images; the image is cut to provide for convenient subsequent character recognition, so that the subsequent recognition processing can be performed more quickly.

S13: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;

in the implementation process of the invention, the character recognition algorithm based on the binary image carries out recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image; extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features; inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters; and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.

Specifically, binarizing the independent character image, graying the independent character image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of a gray level range or an average value of the image, and the like, so that the binarized independent character image which can still reflect the whole and local characteristics of the image is obtained, and the subsequent characteristic extraction processing can be continued after the image is binarized; extracting dot matrix features, feature lines and grid features of the binary independent character image based on a feature extraction method, placing the binary independent character image in a coordinate system, determining a dot value according to the existence or non-existence of pixels on the coordinate system, splicing each eight pixel values into bytes to obtain dot matrix features, counting the number of line segments of even lines and columns to form a two-dimensional feature vector, recombining dot matrixes to obtain feature lines, partitioning the dot matrix structure, counting the number of foreground pixels in each partition to serve as statistical features to obtain grid features, wherein the dot matrix features reflect the overall features of the character image, the feature lines and the grid features reflect the local features of the character image, and the features have complementary relations; training a sample by using a BP neural network and using a log sigmode linear function, and setting the maximum iteration times, wherein the structure of the BP neural network comprises an input layer, a hidden layer and an output layer, and characteristic information of a binarized independent character image is input into the trained BP neural network to perform character classification processing to obtain classified characters; the classified characters are subjected to fine classification recognition processing based on a hierarchical recognition algorithm, classification is firstly carried out through clustering processing, classified character images are input into corresponding nodes of a tree structure to finish coarse classification, probabilistic averageing functions are used for integration, and sub-nodes of the results of the coarse classification are further classified to obtain recognition results, text information is obtained, and better recognition accuracy can be obtained through the hierarchical recognition algorithm.

S14: carrying out data preprocessing on the text information to obtain preprocessed text information;

in the implementation process of the invention, the data preprocessing of the text information to obtain preprocessed text information comprises the following steps: carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment; word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained; and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.

Specifically, corpus cleaning is carried out on the text information, character strings in the text information are matched with character strings conforming to rules by utilizing regular matching rules, special characters, repeated data and stop words are removed, and the text information after corpus cleaning is obtained; word segmentation processing is carried out on the text information after data cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently arranged in the corpus is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that the text information after word segmentation processing is obtained; part-of-speech tagging is carried out on text information after word segmentation, part-of-speech tagging is regarded as a sequence tagging, a unit sequence is given, a tag is allocated to each unit in the sequence, probability distribution of possible tag sequences is calculated, the best tag sequence is selected, the most probable part of speech of a word is judged, and then tagging is carried out, so that the text information after part-of-speech tagging is obtained; the text information is preprocessed, so that the information is cleaner, more accurate and more reliable.

S15: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;

in the implementation process of the invention, the text vectorization processing is carried out on the preprocessed text information based on the feature extraction method to obtain the feature vector, which comprises the following steps: carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors; and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.

Specifically, an Embedding algorithm is adopted to conduct numerical processing on the preprocessed text information, each word of the preprocessed text information is used as a feature column through a one-hot function and mapped into a word vector space to obtain an original word vector, and then dimension reduction is conducted on the original word vector to obtain a word vector which can be input into an input layer of a CNN model; the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector; and converting the one-dimensional text information into a two-dimensional input vector so as to meet the input requirement of the model.

S16: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.

In the implementation process of the invention, the step of inputting the feature vector into a language model and carrying out classification and analysis processing of text information based on the language model comprises the following steps: constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer; training the initial language model to obtain a trained language model; and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.

Specifically, the language model structure includes: the input layer inputs the feature vector, the convolution layer carries out convolution processing on the input feature vector, wherein the used activation function is a Relu function, the convolution layer comprises a plurality of convolution kernels, and the convolution kernels in the convolution layer only slide in one direction; the Pooling layer carries out maximum Pooling treatment on the convolved feature vector, and the Pooling layer only carries out max-Pooling operation in one direction; the full-connection layer carries out cascading treatment on the pooled feature vectors, wherein dropout regularization is added, and overfitting is prevented; the output layer uses a softmax activation function to compress the dimensions of the number of categories of the vector, obtain the probability of classifying the text into different categories, and obtain text semantic information; the textCNN model has the advantages of simpler model and high training speed, and can achieve good effect.

Specifically, the training process of the language model includes: collecting text information data, and labeling the text information data to obtain labeled text information data; performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels; importing the feature vector with the label into an initial language model; dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set; training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model; testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model; optimizing the model based on the result of the test set, wherein the optimizing the model comprises sequentially calculating and weighting each layer from the input layer in the forward propagation process aiming at the divided data set, and obtaining a preliminary output result through a nonlinear function, namely an activation function; calculating an error value by using the loss function, and carrying out back propagation according to the error value; at the end of each batch processing, the parameters of the model are updated using the optimizer.

Example two

Referring to fig. 2, fig. 2 is a schematic structural diagram of an intelligent content recognition and analysis device based on NLP and AI in an embodiment of the invention.

As shown in fig. 2, a method for identifying and analyzing intelligent content based on NLP and AI, the method comprises:

the optimization processing module 21: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;

Character cutting module 22: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;

Character recognition module 23: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;

Preprocessing module 24: carrying out data preprocessing on the text information to obtain preprocessed text information;

Feature extraction module 25: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;

Classification and parsing module 26: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

In addition, the foregoing has outlined rather broadly the more detailed description of embodiments of the invention in order that the detailed description of the principles and embodiments of the invention may be implemented in conjunction with the present examples, the above examples being provided to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. An intelligent content identification and analysis method based on NLP and AI, which is characterized by comprising the following steps:

the character recognition algorithm based on the binary image performs recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image, extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises dot matrix features, feature lines and grid features, placing the binarized independent character image in a coordinate system, splicing pixel values of the binarized independent character image in the coordinate system to obtain dot matrix features, recombining two-dimensional feature vectors generated by the dot matrix features to obtain feature lines, performing structure partitioning processing on the dot matrix features to obtain grid features, inputting feature information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters, performing fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm, classifying the classified characters based on a clustering method, inputting the classified character images into nodes corresponding to a tree structure to perform coarse classification, integrating the integrated character images based on Probabilistic averageing functions, and classifying the integrated character images by using sub-nodes of coarse classification results to obtain text information;

performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain a feature vector, performing text vectorization processing on the preprocessed text information based on the feature extraction method to obtain the feature vector, wherein the method comprises the following steps: performing numerical processing on the preprocessed text information by adopting an Embedding algorithm, mapping each word in the preprocessed text information into a word vector space based on a one-hot function, obtaining an original word vector, performing dimension reduction processing on the original word vector, obtaining a word vector, inputting the word vector into a CNN model, and performing feature extraction to obtain a feature vector;

2. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the obtaining the original document image and optimizing the original document image using an image enhancement processing technique to obtain an optimized document image comprises:

3. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the image-based character cutting method processes the optimized file image to obtain an independent character image, comprising:

4. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the performing data preprocessing on the text information to obtain preprocessed text information comprises:

5. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the structure of CNN model comprises:

6. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the inputting the feature vector into a language model and classifying and analyzing text information based on the language model comprises:

training the initial language model to obtain a trained language model;

7. The intelligent content recognition and analysis method based on NLP and AI of claim 6, wherein the training process of the language model comprises:

importing the feature vector with the label into an initial language model;

and optimizing the model based on the result of the test set.

8. An intelligent content recognition and analysis device based on NLP and AI, the device comprising:

and a character recognition module: the character recognition algorithm based on the binary image performs recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image, extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises dot matrix features, feature lines and grid features, placing the binarized independent character image in a coordinate system, splicing pixel values of the binarized independent character image in the coordinate system to obtain dot matrix features, recombining two-dimensional feature vectors generated by the dot matrix features to obtain feature lines, performing structure partitioning processing on the dot matrix features to obtain grid features, inputting feature information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters, performing fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm, classifying the classified characters based on a clustering method, inputting the classified character images into nodes corresponding to a tree structure to perform coarse classification, integrating the integrated character images based on Probabilistic averageing functions, and classifying the integrated character images by using sub-nodes of coarse classification results to obtain text information;

and the feature extraction module is used for: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain a feature vector, performing text vectorization processing on the preprocessed text information based on the feature extraction method to obtain the feature vector, wherein the method comprises the following steps: performing numerical processing on the preprocessed text information by adopting an Embedding algorithm, mapping each word in the preprocessed text information into a word vector space based on a one-hot function, obtaining an original word vector, performing dimension reduction processing on the original word vector, obtaining a word vector, inputting the word vector into a CNN model, and performing feature extraction to obtain a feature vector;