CN116912845B - Intelligent content identification and analysis method and device based on NLP and AI - Google Patents
Intelligent content identification and analysis method and device based on NLP and AI Download PDFInfo
- Publication number
- CN116912845B CN116912845B CN202310726304.5A CN202310726304A CN116912845B CN 116912845 B CN116912845 B CN 116912845B CN 202310726304 A CN202310726304 A CN 202310726304A CN 116912845 B CN116912845 B CN 116912845B
- Authority
- CN
- China
- Prior art keywords
- image
- processing
- text information
- feature
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004458 analytical method Methods 0.000 title claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 180
- 239000013598 vector Substances 0.000 claims abstract description 108
- 238000000034 method Methods 0.000 claims abstract description 89
- 238000005457 optimization Methods 0.000 claims abstract description 32
- 238000000605 extraction Methods 0.000 claims abstract description 29
- 238000005520 cutting process Methods 0.000 claims abstract description 24
- 238000005516 engineering process Methods 0.000 claims abstract description 23
- 238000007781 pre-processing Methods 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims description 31
- 230000006870 function Effects 0.000 claims description 28
- 230000008569 process Effects 0.000 claims description 26
- 238000011176 pooling Methods 0.000 claims description 24
- 238000012937 correction Methods 0.000 claims description 20
- 238000005452 bending Methods 0.000 claims description 18
- 238000004140 cleaning Methods 0.000 claims description 18
- 230000011218 segmentation Effects 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 16
- 238000012360 testing method Methods 0.000 claims description 16
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 238000012795 verification Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 10
- 238000001914 filtration Methods 0.000 claims description 8
- 238000002372 labelling Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 4
- 238000000638 solvent extraction Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims 2
- 238000003058 natural language processing Methods 0.000 description 19
- 238000012217 deletion Methods 0.000 description 8
- 230000037430 deletion Effects 0.000 description 8
- 238000013135 deep learning Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000000295 complement effect Effects 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000010354 integration Effects 0.000 description 2
- 238000012886 linear function Methods 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 230000008719 thickening Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
- G06V30/1607—Correcting image deformation, e.g. trapezoidal deformation caused by perspective
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
- G06V30/162—Quantising the image signal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
- G06V30/164—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/16—Image preprocessing
- G06V30/168—Smoothing or thinning of the pattern; Skeletonisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/18—Extraction of features or characteristics of the image
- G06V30/1801—Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Medical Informatics (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses an intelligent content identification and analysis method and device based on NLP and AI, wherein the method comprises the following steps: acquiring an original file image, and optimizing the file image by using an image enhancement processing technology to acquire an optimized file image; processing the file image after optimization processing based on an image character cutting method to obtain an independent character image; recognizing the independent character image based on a character recognition algorithm of the binary image to obtain text information; carrying out data preprocessing on the text information to obtain preprocessed text information; performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors; the feature vector is input into a language model, and the text information is classified and analyzed based on the language model. The invention adopts NLP and AI technology, realizes intelligent content identification and analysis, provides rich information input, and provides big data support for intelligent operation of enterprises.
Description
Technical Field
The invention relates to the technical field of natural language processing and artificial intelligence, in particular to an intelligent content identification and analysis method and device based on NLP and AI.
Background
At present, AI technology is vigorously developed, AI products are widely applied to daily life of people, and in the process of AI technology development, one technology plays an indispensable role, namely NLP technology. NLP technology is a very important direction in the field of artificial intelligence today, and its purpose is to achieve efficient communication between humans and computer programs using natural language. At present, a rough management mode of taking the whole electronic file as a management unit is adopted in a traditional service system, and in order to change the mode, NLP and AI technologies are adopted, so that intelligent content identification and analysis are realized, rich information input is provided for various service systems, and therefore large data support is provided for intelligent operation of enterprises.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides an intelligent content identification and analysis method and device based on NLP and AI, which adopt NLP and AI technology to realize intelligent content identification and analysis and provide rich information input for various business systems, thereby providing large data support for intelligent operation of enterprises.
In order to solve the technical problems, an embodiment of the present invention provides an intelligent content identification and analysis method based on NLP and AI, the method comprising:
Acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
carrying out data preprocessing on the text information to obtain preprocessed text information;
performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
Optionally, the obtaining the original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image, including:
acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image;
performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing;
Performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing;
and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
Optionally, the image-based character cutting method processes the file image after the optimization processing to obtain an independent character image, and includes:
performing binarization processing on the file image after optimization processing to obtain a binarized image;
carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton;
and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
Optionally, the binary image-based character recognition algorithm performs recognition processing on the independent character image to obtain text information, and includes:
performing binarization processing on the independent character image to obtain a binarized independent character image;
extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features;
Inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters;
and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.
Optionally, the performing data preprocessing on the text information to obtain preprocessed text information includes:
carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment;
word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained;
and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
Optionally, the text vectorization processing is performed on the preprocessed text information based on the feature extraction method to obtain feature vectors, which includes:
carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors;
and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.
Optionally, the structure of the CNN model includes:
the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector.
Optionally, the inputting the feature vector into a language model, and performing classification and parsing processing of text information based on the language model includes:
constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer;
training the initial language model to obtain a trained language model;
and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
Optionally, the training process of the language model includes:
collecting text information data, and labeling the text information data to obtain labeled text information data;
performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels;
importing the feature vector with the label into an initial language model;
dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set;
training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model;
Testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model;
and optimizing the model based on the result of the test set.
In addition, the embodiment of the invention also provides an intelligent content recognition and analysis device based on NLP and AI, which comprises:
and the optimization processing module is used for: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
and a character cutting module: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
and a character recognition module: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
and a pretreatment module: carrying out data preprocessing on the text information to obtain preprocessed text information;
and the feature extraction module is used for: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
classification and analysis module: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of an intelligent content recognition and analysis method based on NLP and AI;
Fig. 2 is a schematic structural diagram of an intelligent content recognition and analysis device based on NLP and AI.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, fig. 1 is a flow chart of an intelligent content recognition and analysis method based on NLP and AI in an embodiment of the invention.
As shown in fig. 1, a method for identifying and analyzing intelligent content based on NLP and AI, the method comprising:
s11: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
in the implementation process of the invention, the method for obtaining the original file image and optimizing the original file image by using the image enhancement processing technology to obtain the optimized file image comprises the following steps: acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image; performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing; performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing; and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
Specifically, an original file image is obtained, and denoising is carried out on the original file image by using a median filtering method, wherein the median filtering method is used for replacing the value of one point in the image with the median value of each point value in one field of the point, surrounding pixel values are close to a true value, isolated noise points are eliminated, and the denoised file image is obtained; carrying out enhanced image edge processing on the de-noised file image based on a Hilbert transform method, firstly modulating the object field of the file image in the frequency domain, carrying out Fourier transform on the frequency domain, taking the imaginary part of the frequency domain after the Fourier transform, taking the absolute value, calculating the 1/n power after summing along the frequency, and carrying out edge enhancement along the vertical axis and the horizontal axis respectively to obtain the file image after the enhanced image edge processing; performing bending correction processing on a file image subjected to edge processing of an enhanced image based on an offset field method, firstly forming an offset field through a deformation correction network, and enabling the offset field to perform corresponding displacement on each pixel of the file image to obtain the file image subjected to bending correction processing; performing handwriting erasing processing on the file image subjected to the bending correction processing based on a handwriting erasing clustering method, automatically classifying similar objects into the same cluster by continuously taking data of the nearest mean value from the centroid of the image, circularly executing until the clustering is completed, and correcting the part of the image needing handwriting erasing after the clustering is completed to obtain the file image subjected to the optimization processing; the image enhancement technology is used for processing the image, so that the useful information of the image can be enhanced, and the quality of the image can be improved, so that the image can be further processed.
S12: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
in the implementation process of the invention, the method for cutting the characters based on the images processes the file images after the optimization processing to obtain independent character images, and comprises the following steps: performing binarization processing on the file image after optimization processing to obtain a binarized image; carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton; and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
Specifically, binarizing the file image after optimization, graying the image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of the gray level range or the average value of the image, so as to obtain a binarized image which can still reflect the whole and local characteristics of the image, and the binarized image is convenient for the subsequent processing of a thinning algorithm; firstly, setting a neighborhood template based on a refinement algorithm to refine the binarized image, judging whether each binarized image pixel point meets the following conditions, wherein the first condition bit ensures that the black points [2,6] of a central pixel are between, the second condition ensures that the deletion of the pixel point does not influence the connectivity of the image, the third condition is that the deletion of the pixel point does not generate horizontal line fracture, the fourth condition is that the deletion of the pixel point does not generate vertical line fracture, and the pixel point is deleted when the four conditions are met, and repeating the deletion operation until the pixel point which does not meet the conditions is obtained; after a character skeleton is obtained by using a thinning algorithm, positioning crossing points according to the character skeleton, setting an optimal interval, superposing character images according to the position relation of the character images on an original image, using thickening processing to obtain a dividing line, judging the length of the crossing points from the dividing line, defining a fault tolerance for the length, wherein crossing points in the fault tolerance are necessary crossing points, eliminating unnecessary crossing points, and performing cutting operation to obtain independent character images; the image is cut to provide for convenient subsequent character recognition, so that the subsequent recognition processing can be performed more quickly.
S13: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
in the implementation process of the invention, the character recognition algorithm based on the binary image carries out recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image; extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features; inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters; and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.
Specifically, binarizing the independent character image, graying the independent character image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of a gray level range or an average value of the image, and the like, so that the binarized independent character image which can still reflect the whole and local characteristics of the image is obtained, and the subsequent characteristic extraction processing can be continued after the image is binarized; extracting dot matrix features, feature lines and grid features of the binary independent character image based on a feature extraction method, placing the binary independent character image in a coordinate system, determining a dot value according to the existence or non-existence of pixels on the coordinate system, splicing each eight pixel values into bytes to obtain dot matrix features, counting the number of line segments of even lines and columns to form a two-dimensional feature vector, recombining dot matrixes to obtain feature lines, partitioning the dot matrix structure, counting the number of foreground pixels in each partition to serve as statistical features to obtain grid features, wherein the dot matrix features reflect the overall features of the character image, the feature lines and the grid features reflect the local features of the character image, and the features have complementary relations; training a sample by using a BP neural network and using a log sigmode linear function, and setting the maximum iteration times, wherein the structure of the BP neural network comprises an input layer, a hidden layer and an output layer, and characteristic information of a binarized independent character image is input into the trained BP neural network to perform character classification processing to obtain classified characters; the classified characters are subjected to fine classification recognition processing based on a hierarchical recognition algorithm, classification is firstly carried out through clustering processing, classified character images are input into corresponding nodes of a tree structure to finish coarse classification, probabilistic averageing functions are used for integration, and sub-nodes of the results of the coarse classification are further classified to obtain recognition results, text information is obtained, and better recognition accuracy can be obtained through the hierarchical recognition algorithm.
S14: carrying out data preprocessing on the text information to obtain preprocessed text information;
in the implementation process of the invention, the data preprocessing of the text information to obtain preprocessed text information comprises the following steps: carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment; word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained; and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
Specifically, corpus cleaning is carried out on the text information, character strings in the text information are matched with character strings conforming to rules by utilizing regular matching rules, special characters, repeated data and stop words are removed, and the text information after corpus cleaning is obtained; word segmentation processing is carried out on the text information after data cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently arranged in the corpus is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that the text information after word segmentation processing is obtained; part-of-speech tagging is carried out on text information after word segmentation, part-of-speech tagging is regarded as a sequence tagging, a unit sequence is given, a tag is allocated to each unit in the sequence, probability distribution of possible tag sequences is calculated, the best tag sequence is selected, the most probable part of speech of a word is judged, and then tagging is carried out, so that the text information after part-of-speech tagging is obtained; the text information is preprocessed, so that the information is cleaner, more accurate and more reliable.
S15: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
in the implementation process of the invention, the text vectorization processing is carried out on the preprocessed text information based on the feature extraction method to obtain the feature vector, which comprises the following steps: carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors; and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.
Specifically, an Embedding algorithm is adopted to conduct numerical processing on the preprocessed text information, each word of the preprocessed text information is used as a feature column through a one-hot function and mapped into a word vector space to obtain an original word vector, and then dimension reduction is conducted on the original word vector to obtain a word vector which can be input into an input layer of a CNN model; the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector; and converting the one-dimensional text information into a two-dimensional input vector so as to meet the input requirement of the model.
S16: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
In the implementation process of the invention, the step of inputting the feature vector into a language model and carrying out classification and analysis processing of text information based on the language model comprises the following steps: constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer; training the initial language model to obtain a trained language model; and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
Specifically, the language model structure includes: the input layer inputs the feature vector, the convolution layer carries out convolution processing on the input feature vector, wherein the used activation function is a Relu function, the convolution layer comprises a plurality of convolution kernels, and the convolution kernels in the convolution layer only slide in one direction; the Pooling layer carries out maximum Pooling treatment on the convolved feature vector, and the Pooling layer only carries out max-Pooling operation in one direction; the full-connection layer carries out cascading treatment on the pooled feature vectors, wherein dropout regularization is added, and overfitting is prevented; the output layer uses a softmax activation function to compress the dimensions of the number of categories of the vector, obtain the probability of classifying the text into different categories, and obtain text semantic information; the textCNN model has the advantages of simpler model and high training speed, and can achieve good effect.
Specifically, the training process of the language model includes: collecting text information data, and labeling the text information data to obtain labeled text information data; performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels; importing the feature vector with the label into an initial language model; dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set; training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model; testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model; optimizing the model based on the result of the test set, wherein the optimizing the model comprises sequentially calculating and weighting each layer from the input layer in the forward propagation process aiming at the divided data set, and obtaining a preliminary output result through a nonlinear function, namely an activation function; calculating an error value by using the loss function, and carrying out back propagation according to the error value; at the end of each batch processing, the parameters of the model are updated using the optimizer.
In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of an intelligent content recognition and analysis device based on NLP and AI in an embodiment of the invention.
As shown in fig. 2, a method for identifying and analyzing intelligent content based on NLP and AI, the method comprises:
the optimization processing module 21: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
In the implementation process of the invention, the method for obtaining the original file image and optimizing the original file image by using the image enhancement processing technology to obtain the optimized file image comprises the following steps: acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image; performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing; performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing; and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
Specifically, an original file image is obtained, and denoising is carried out on the original file image by using a median filtering method, wherein the median filtering method is used for replacing the value of one point in the image with the median value of each point value in one field of the point, surrounding pixel values are close to a true value, isolated noise points are eliminated, and the denoised file image is obtained; carrying out enhanced image edge processing on the de-noised file image based on a Hilbert transform method, firstly modulating the object field of the file image in the frequency domain, carrying out Fourier transform on the frequency domain, taking the imaginary part of the frequency domain after the Fourier transform, taking the absolute value, calculating the 1/n power after summing along the frequency, and carrying out edge enhancement along the vertical axis and the horizontal axis respectively to obtain the file image after the enhanced image edge processing; performing bending correction processing on a file image subjected to edge processing of an enhanced image based on an offset field method, firstly forming an offset field through a deformation correction network, and enabling the offset field to perform corresponding displacement on each pixel of the file image to obtain the file image subjected to bending correction processing; performing handwriting erasing processing on the file image subjected to the bending correction processing based on a handwriting erasing clustering method, automatically classifying similar objects into the same cluster by continuously taking data of the nearest mean value from the centroid of the image, circularly executing until the clustering is completed, and correcting the part of the image needing handwriting erasing after the clustering is completed to obtain the file image subjected to the optimization processing; the image enhancement technology is used for processing the image, so that the useful information of the image can be enhanced, and the quality of the image can be improved, so that the image can be further processed.
Character cutting module 22: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
in the implementation process of the invention, the method for cutting the characters based on the images processes the file images after the optimization processing to obtain independent character images, and comprises the following steps: performing binarization processing on the file image after optimization processing to obtain a binarized image; carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton; and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
Specifically, binarizing the file image after optimization, graying the image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of the gray level range or the average value of the image, so as to obtain a binarized image which can still reflect the whole and local characteristics of the image, and the binarized image is convenient for the subsequent processing of a thinning algorithm; firstly, setting a neighborhood template based on a refinement algorithm to refine the binarized image, judging whether each binarized image pixel point meets the following conditions, wherein the first condition bit ensures that the black points [2,6] of a central pixel are between, the second condition ensures that the deletion of the pixel point does not influence the connectivity of the image, the third condition is that the deletion of the pixel point does not generate horizontal line fracture, the fourth condition is that the deletion of the pixel point does not generate vertical line fracture, and the pixel point is deleted when the four conditions are met, and repeating the deletion operation until the pixel point which does not meet the conditions is obtained; after a character skeleton is obtained by using a thinning algorithm, positioning crossing points according to the character skeleton, setting an optimal interval, superposing character images according to the position relation of the character images on an original image, using thickening processing to obtain a dividing line, judging the length of the crossing points from the dividing line, defining a fault tolerance for the length, wherein crossing points in the fault tolerance are necessary crossing points, eliminating unnecessary crossing points, and performing cutting operation to obtain independent character images; the image is cut to provide for convenient subsequent character recognition, so that the subsequent recognition processing can be performed more quickly.
Character recognition module 23: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
in the implementation process of the invention, the character recognition algorithm based on the binary image carries out recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image; extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features; inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters; and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.
Specifically, binarizing the independent character image, graying the independent character image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of a gray level range or an average value of the image, and the like, so that the binarized independent character image which can still reflect the whole and local characteristics of the image is obtained, and the subsequent characteristic extraction processing can be continued after the image is binarized; extracting dot matrix features, feature lines and grid features of the binary independent character image based on a feature extraction method, placing the binary independent character image in a coordinate system, determining a dot value according to the existence or non-existence of pixels on the coordinate system, splicing each eight pixel values into bytes to obtain dot matrix features, counting the number of line segments of even lines and columns to form a two-dimensional feature vector, recombining dot matrixes to obtain feature lines, partitioning the dot matrix structure, counting the number of foreground pixels in each partition to serve as statistical features to obtain grid features, wherein the dot matrix features reflect the overall features of the character image, the feature lines and the grid features reflect the local features of the character image, and the features have complementary relations; training a sample by using a BP neural network and using a log sigmode linear function, and setting the maximum iteration times, wherein the structure of the BP neural network comprises an input layer, a hidden layer and an output layer, and characteristic information of a binarized independent character image is input into the trained BP neural network to perform character classification processing to obtain classified characters; the classified characters are subjected to fine classification recognition processing based on a hierarchical recognition algorithm, classification is firstly carried out through clustering processing, classified character images are input into corresponding nodes of a tree structure to finish coarse classification, probabilistic averageing functions are used for integration, and sub-nodes of the results of the coarse classification are further classified to obtain recognition results, text information is obtained, and better recognition accuracy can be obtained through the hierarchical recognition algorithm.
Preprocessing module 24: carrying out data preprocessing on the text information to obtain preprocessed text information;
in the implementation process of the invention, the data preprocessing of the text information to obtain preprocessed text information comprises the following steps: carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment; word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained; and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
Specifically, corpus cleaning is carried out on the text information, character strings in the text information are matched with character strings conforming to rules by utilizing regular matching rules, special characters, repeated data and stop words are removed, and the text information after corpus cleaning is obtained; word segmentation processing is carried out on the text information after data cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently arranged in the corpus is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that the text information after word segmentation processing is obtained; part-of-speech tagging is carried out on text information after word segmentation, part-of-speech tagging is regarded as a sequence tagging, a unit sequence is given, a tag is allocated to each unit in the sequence, probability distribution of possible tag sequences is calculated, the best tag sequence is selected, the most probable part of speech of a word is judged, and then tagging is carried out, so that the text information after part-of-speech tagging is obtained; the text information is preprocessed, so that the information is cleaner, more accurate and more reliable.
Feature extraction module 25: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
in the implementation process of the invention, the text vectorization processing is carried out on the preprocessed text information based on the feature extraction method to obtain the feature vector, which comprises the following steps: carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors; and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.
Specifically, an Embedding algorithm is adopted to conduct numerical processing on the preprocessed text information, each word of the preprocessed text information is used as a feature column through a one-hot function and mapped into a word vector space to obtain an original word vector, and then dimension reduction is conducted on the original word vector to obtain a word vector which can be input into an input layer of a CNN model; the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector; and converting the one-dimensional text information into a two-dimensional input vector so as to meet the input requirement of the model.
Classification and parsing module 26: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
In the implementation process of the invention, the step of inputting the feature vector into a language model and carrying out classification and analysis processing of text information based on the language model comprises the following steps: constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer; training the initial language model to obtain a trained language model; and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
Specifically, the language model structure includes: the input layer inputs the feature vector, the convolution layer carries out convolution processing on the input feature vector, wherein the used activation function is a Relu function, the convolution layer comprises a plurality of convolution kernels, and the convolution kernels in the convolution layer only slide in one direction; the Pooling layer carries out maximum Pooling treatment on the convolved feature vector, and the Pooling layer only carries out max-Pooling operation in one direction; the full-connection layer carries out cascading treatment on the pooled feature vectors, wherein dropout regularization is added, and overfitting is prevented; the output layer uses a softmax activation function to compress the dimensions of the number of categories of the vector, obtain the probability of classifying the text into different categories, and obtain text semantic information; the textCNN model has the advantages of simpler model and high training speed, and can achieve good effect.
Specifically, the training process of the language model includes: collecting text information data, and labeling the text information data to obtain labeled text information data; performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels; importing the feature vector with the label into an initial language model; dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set; training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model; testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model; optimizing the model based on the result of the test set, wherein the optimizing the model comprises sequentially calculating and weighting each layer from the input layer in the forward propagation process aiming at the divided data set, and obtaining a preliminary output result through a nonlinear function, namely an activation function; calculating an error value by using the loss function, and carrying out back propagation according to the error value; at the end of each batch processing, the parameters of the model are updated using the optimizer.
In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
In addition, the foregoing has outlined rather broadly the more detailed description of embodiments of the invention in order that the detailed description of the principles and embodiments of the invention may be implemented in conjunction with the present examples, the above examples being provided to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.
Claims (8)
1. An intelligent content identification and analysis method based on NLP and AI, which is characterized by comprising the following steps:
acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
the character recognition algorithm based on the binary image performs recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image, extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises dot matrix features, feature lines and grid features, placing the binarized independent character image in a coordinate system, splicing pixel values of the binarized independent character image in the coordinate system to obtain dot matrix features, recombining two-dimensional feature vectors generated by the dot matrix features to obtain feature lines, performing structure partitioning processing on the dot matrix features to obtain grid features, inputting feature information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters, performing fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm, classifying the classified characters based on a clustering method, inputting the classified character images into nodes corresponding to a tree structure to perform coarse classification, integrating the integrated character images based on Probabilistic averageing functions, and classifying the integrated character images by using sub-nodes of coarse classification results to obtain text information;
Carrying out data preprocessing on the text information to obtain preprocessed text information;
performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain a feature vector, performing text vectorization processing on the preprocessed text information based on the feature extraction method to obtain the feature vector, wherein the method comprises the following steps: performing numerical processing on the preprocessed text information by adopting an Embedding algorithm, mapping each word in the preprocessed text information into a word vector space based on a one-hot function, obtaining an original word vector, performing dimension reduction processing on the original word vector, obtaining a word vector, inputting the word vector into a CNN model, and performing feature extraction to obtain a feature vector;
inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
2. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the obtaining the original document image and optimizing the original document image using an image enhancement processing technique to obtain an optimized document image comprises:
Acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image;
performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing;
performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing;
and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
3. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the image-based character cutting method processes the optimized file image to obtain an independent character image, comprising:
performing binarization processing on the file image after optimization processing to obtain a binarized image;
carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton;
and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
4. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the performing data preprocessing on the text information to obtain preprocessed text information comprises:
carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment;
word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained;
and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
5. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the structure of CNN model comprises:
the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector.
6. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the inputting the feature vector into a language model and classifying and analyzing text information based on the language model comprises:
constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer;
training the initial language model to obtain a trained language model;
and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
7. The intelligent content recognition and analysis method based on NLP and AI of claim 6, wherein the training process of the language model comprises:
collecting text information data, and labeling the text information data to obtain labeled text information data;
performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels;
importing the feature vector with the label into an initial language model;
dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set;
Training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model;
testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model;
and optimizing the model based on the result of the test set.
8. An intelligent content recognition and analysis device based on NLP and AI, the device comprising:
and the optimization processing module is used for: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
and a character cutting module: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
and a character recognition module: the character recognition algorithm based on the binary image performs recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image, extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises dot matrix features, feature lines and grid features, placing the binarized independent character image in a coordinate system, splicing pixel values of the binarized independent character image in the coordinate system to obtain dot matrix features, recombining two-dimensional feature vectors generated by the dot matrix features to obtain feature lines, performing structure partitioning processing on the dot matrix features to obtain grid features, inputting feature information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters, performing fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm, classifying the classified characters based on a clustering method, inputting the classified character images into nodes corresponding to a tree structure to perform coarse classification, integrating the integrated character images based on Probabilistic averageing functions, and classifying the integrated character images by using sub-nodes of coarse classification results to obtain text information;
And a pretreatment module: carrying out data preprocessing on the text information to obtain preprocessed text information;
and the feature extraction module is used for: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain a feature vector, performing text vectorization processing on the preprocessed text information based on the feature extraction method to obtain the feature vector, wherein the method comprises the following steps: performing numerical processing on the preprocessed text information by adopting an Embedding algorithm, mapping each word in the preprocessed text information into a word vector space based on a one-hot function, obtaining an original word vector, performing dimension reduction processing on the original word vector, obtaining a word vector, inputting the word vector into a CNN model, and performing feature extraction to obtain a feature vector;
classification and analysis module: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310726304.5A CN116912845B (en) | 2023-06-16 | 2023-06-16 | Intelligent content identification and analysis method and device based on NLP and AI |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310726304.5A CN116912845B (en) | 2023-06-16 | 2023-06-16 | Intelligent content identification and analysis method and device based on NLP and AI |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116912845A CN116912845A (en) | 2023-10-20 |
CN116912845B true CN116912845B (en) | 2024-03-19 |
Family
ID=88359159
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310726304.5A Active CN116912845B (en) | 2023-06-16 | 2023-06-16 | Intelligent content identification and analysis method and device based on NLP and AI |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116912845B (en) |
Citations (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329731A (en) * | 2008-06-06 | 2008-12-24 | 南开大学 | Automatic recognition method pf mathematical formula in image |
CN101329734A (en) * | 2008-07-31 | 2008-12-24 | 重庆大学 | License plate character recognition method based on K-L transform and LS-SVM |
CN102024144A (en) * | 2010-11-23 | 2011-04-20 | 上海海事大学 | Container number identification method |
CN102509091A (en) * | 2011-11-29 | 2012-06-20 | 北京航空航天大学 | Airplane tail number recognition method |
US8699796B1 (en) * | 2008-11-11 | 2014-04-15 | Trend Micro Incorporated | Identifying sensitive expressions in images for languages with large alphabets |
CA2920795A1 (en) * | 2014-02-07 | 2015-08-13 | Cellular South, Inc Dba C Spire Wire Wireless | Video to data |
CN108009228A (en) * | 2017-11-27 | 2018-05-08 | 咪咕互动娱乐有限公司 | A kind of method to set up of content tab, device and storage medium |
WO2018086519A1 (en) * | 2016-11-08 | 2018-05-17 | 北京国双科技有限公司 | Method and device for identifying specific text information |
CN108564079A (en) * | 2018-05-08 | 2018-09-21 | 东华大学 | A kind of portable character recognition device and method |
CN109255113A (en) * | 2018-09-04 | 2019-01-22 | 郑州信大壹密科技有限公司 | Intelligent critique system |
CN109359293A (en) * | 2018-09-13 | 2019-02-19 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
CN109376658A (en) * | 2018-10-26 | 2019-02-22 | 信雅达系统工程股份有限公司 | A kind of OCR method based on deep learning |
CN109460735A (en) * | 2018-11-09 | 2019-03-12 | 中国科学院自动化研究所 | Document binary processing method, system, device based on figure semi-supervised learning |
CN110059694A (en) * | 2019-04-19 | 2019-07-26 | 山东大学 | The intelligent identification Method of lteral data under power industry complex scene |
CN110297907A (en) * | 2019-06-28 | 2019-10-01 | 谭浩 | Generate method, computer readable storage medium and the terminal device of interview report |
CN110298376A (en) * | 2019-05-16 | 2019-10-01 | 西安电子科技大学 | A kind of bank money image classification method based on improvement B-CNN |
WO2019195736A2 (en) * | 2018-04-05 | 2019-10-10 | Chevron U.S.A. Inc. | Classification of piping and instrumental diagram information using machine-learning |
CN110363194A (en) * | 2019-06-17 | 2019-10-22 | 深圳壹账通智能科技有限公司 | Intelligently reading method, apparatus, equipment and storage medium based on NLP |
CN110457466A (en) * | 2019-06-28 | 2019-11-15 | 谭浩 | Generate method, computer readable storage medium and the terminal device of interview report |
CN110889402A (en) * | 2019-11-04 | 2020-03-17 | 广州丰石科技有限公司 | Business license content identification method and system based on deep learning |
CN111046946A (en) * | 2019-12-10 | 2020-04-21 | 昆明理工大学 | Burma language image text recognition method based on CRNN |
WO2020140386A1 (en) * | 2019-01-02 | 2020-07-09 | 平安科技(深圳)有限公司 | Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium |
WO2020147393A1 (en) * | 2019-01-17 | 2020-07-23 | 平安科技(深圳)有限公司 | Convolutional neural network-based text classification method, and related device |
CN112085024A (en) * | 2020-09-21 | 2020-12-15 | 江苏理工学院 | Tank surface character recognition method |
WO2020253043A1 (en) * | 2019-06-20 | 2020-12-24 | 平安科技(深圳)有限公司 | Intelligent text classification method and apparatus, and computer-readable storage medium |
CN112329779A (en) * | 2020-11-02 | 2021-02-05 | 平安科技(深圳)有限公司 | Method and related device for improving certificate identification accuracy based on mask |
WO2021025290A1 (en) * | 2019-08-06 | 2021-02-11 | 삼성전자 주식회사 | Method and electronic device for converting handwriting input to text |
EP3812950A1 (en) * | 2019-10-23 | 2021-04-28 | Tata Consultancy Services Limited | Method and system for creating an intelligent cartoon comic strip based on dynamic content |
CN112883980A (en) * | 2021-04-28 | 2021-06-01 | 明品云(北京)数据科技有限公司 | Data processing method and system |
WO2021137166A1 (en) * | 2019-12-30 | 2021-07-08 | L&T Technology Services Limited | Domain based text extraction |
CN113158808A (en) * | 2021-03-24 | 2021-07-23 | 华南理工大学 | Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction |
CN113191358A (en) * | 2021-05-31 | 2021-07-30 | 上海交通大学 | Metal part surface text detection method and system |
CN113361666A (en) * | 2021-06-15 | 2021-09-07 | 浪潮金融信息技术有限公司 | Handwritten character recognition method, system and medium |
CN113569833A (en) * | 2021-07-27 | 2021-10-29 | 平安科技(深圳)有限公司 | Text document-based character recognition method, device, equipment and storage medium |
CN113609292A (en) * | 2021-08-09 | 2021-11-05 | 上海交通大学 | Known false news intelligent detection method based on graph structure |
CN114265937A (en) * | 2021-12-24 | 2022-04-01 | 中国电力科学研究院有限公司 | Intelligent classification analysis method and system of scientific and technological information, storage medium and server |
CN114973228A (en) * | 2022-05-31 | 2022-08-30 | 上海交通大学 | Metal part surface text recognition method and system based on contour feature enhancement |
WO2022178919A1 (en) * | 2021-02-23 | 2022-09-01 | 西安交通大学 | Taxpayer industry classification method based on noise label learning |
CN115114908A (en) * | 2022-07-04 | 2022-09-27 | 上海交通大学 | Text intention recognition method and system based on attention mechanism, vehicle and equipment |
CN115471851A (en) * | 2022-10-11 | 2022-12-13 | 小语智能信息科技(云南)有限公司 | Burma language image text recognition method and device fused with double attention mechanism |
CN115546796A (en) * | 2022-09-22 | 2022-12-30 | 中电智元数据科技有限公司 | Non-contact data acquisition method and system based on visual computation |
US11557323B1 (en) * | 2022-03-15 | 2023-01-17 | My Job Matcher, Inc. | Apparatuses and methods for selectively inserting text into a video resume |
CN115731550A (en) * | 2022-11-23 | 2023-03-03 | 大连医科大学附属第二医院 | Deep learning-based automatic drug specification identification method and system and storage medium |
CN115862045A (en) * | 2023-02-16 | 2023-03-28 | 中国人民解放军总医院第一医学中心 | Case automatic identification method, system, equipment and storage medium based on image-text identification technology |
CN115880566A (en) * | 2022-12-16 | 2023-03-31 | 李宜义 | Intelligent marking system based on visual analysis |
CN115953788A (en) * | 2022-12-02 | 2023-04-11 | 兴业银行股份有限公司 | Green financial attribute intelligent identification method and system based on OCR (optical character recognition) and NLP (non-line-segment) technologies |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10218954B2 (en) * | 2013-08-15 | 2019-02-26 | Cellular South, Inc. | Video to data |
US10372816B2 (en) * | 2016-12-13 | 2019-08-06 | International Business Machines Corporation | Preprocessing of string inputs in natural language processing |
US11295123B2 (en) * | 2017-09-14 | 2022-04-05 | Chevron U.S.A. Inc. | Classification of character strings using machine-learning |
US11218500B2 (en) * | 2019-07-31 | 2022-01-04 | Secureworks Corp. | Methods and systems for automated parsing and identification of textual data |
US11651157B2 (en) * | 2020-07-29 | 2023-05-16 | Descript, Inc. | Filler word detection through tokenizing and labeling of transcripts |
US11790025B2 (en) * | 2021-03-30 | 2023-10-17 | Streem, Llc | Extracting data and metadata from identification labels using natural language processing |
-
2023
- 2023-06-16 CN CN202310726304.5A patent/CN116912845B/en active Active
Patent Citations (46)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101329731A (en) * | 2008-06-06 | 2008-12-24 | 南开大学 | Automatic recognition method pf mathematical formula in image |
CN101329734A (en) * | 2008-07-31 | 2008-12-24 | 重庆大学 | License plate character recognition method based on K-L transform and LS-SVM |
US8699796B1 (en) * | 2008-11-11 | 2014-04-15 | Trend Micro Incorporated | Identifying sensitive expressions in images for languages with large alphabets |
CN102024144A (en) * | 2010-11-23 | 2011-04-20 | 上海海事大学 | Container number identification method |
CN102509091A (en) * | 2011-11-29 | 2012-06-20 | 北京航空航天大学 | Airplane tail number recognition method |
CA2920795A1 (en) * | 2014-02-07 | 2015-08-13 | Cellular South, Inc Dba C Spire Wire Wireless | Video to data |
WO2018086519A1 (en) * | 2016-11-08 | 2018-05-17 | 北京国双科技有限公司 | Method and device for identifying specific text information |
CN108009228A (en) * | 2017-11-27 | 2018-05-08 | 咪咕互动娱乐有限公司 | A kind of method to set up of content tab, device and storage medium |
WO2019195736A2 (en) * | 2018-04-05 | 2019-10-10 | Chevron U.S.A. Inc. | Classification of piping and instrumental diagram information using machine-learning |
CN108564079A (en) * | 2018-05-08 | 2018-09-21 | 东华大学 | A kind of portable character recognition device and method |
CN109255113A (en) * | 2018-09-04 | 2019-01-22 | 郑州信大壹密科技有限公司 | Intelligent critique system |
CN109359293A (en) * | 2018-09-13 | 2019-02-19 | 内蒙古大学 | Mongolian name entity recognition method neural network based and its identifying system |
CN109376658A (en) * | 2018-10-26 | 2019-02-22 | 信雅达系统工程股份有限公司 | A kind of OCR method based on deep learning |
CN109460735A (en) * | 2018-11-09 | 2019-03-12 | 中国科学院自动化研究所 | Document binary processing method, system, device based on figure semi-supervised learning |
WO2020140386A1 (en) * | 2019-01-02 | 2020-07-09 | 平安科技(深圳)有限公司 | Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium |
WO2020147393A1 (en) * | 2019-01-17 | 2020-07-23 | 平安科技(深圳)有限公司 | Convolutional neural network-based text classification method, and related device |
CN110059694A (en) * | 2019-04-19 | 2019-07-26 | 山东大学 | The intelligent identification Method of lteral data under power industry complex scene |
CN110298376A (en) * | 2019-05-16 | 2019-10-01 | 西安电子科技大学 | A kind of bank money image classification method based on improvement B-CNN |
CN110363194A (en) * | 2019-06-17 | 2019-10-22 | 深圳壹账通智能科技有限公司 | Intelligently reading method, apparatus, equipment and storage medium based on NLP |
WO2020253043A1 (en) * | 2019-06-20 | 2020-12-24 | 平安科技(深圳)有限公司 | Intelligent text classification method and apparatus, and computer-readable storage medium |
CN110457466A (en) * | 2019-06-28 | 2019-11-15 | 谭浩 | Generate method, computer readable storage medium and the terminal device of interview report |
CN110297907A (en) * | 2019-06-28 | 2019-10-01 | 谭浩 | Generate method, computer readable storage medium and the terminal device of interview report |
WO2021025290A1 (en) * | 2019-08-06 | 2021-02-11 | 삼성전자 주식회사 | Method and electronic device for converting handwriting input to text |
EP3812950A1 (en) * | 2019-10-23 | 2021-04-28 | Tata Consultancy Services Limited | Method and system for creating an intelligent cartoon comic strip based on dynamic content |
CN110889402A (en) * | 2019-11-04 | 2020-03-17 | 广州丰石科技有限公司 | Business license content identification method and system based on deep learning |
CN111046946A (en) * | 2019-12-10 | 2020-04-21 | 昆明理工大学 | Burma language image text recognition method based on CRNN |
WO2021137166A1 (en) * | 2019-12-30 | 2021-07-08 | L&T Technology Services Limited | Domain based text extraction |
CN112085024A (en) * | 2020-09-21 | 2020-12-15 | 江苏理工学院 | Tank surface character recognition method |
CN112329779A (en) * | 2020-11-02 | 2021-02-05 | 平安科技(深圳)有限公司 | Method and related device for improving certificate identification accuracy based on mask |
WO2022178919A1 (en) * | 2021-02-23 | 2022-09-01 | 西安交通大学 | Taxpayer industry classification method based on noise label learning |
CN113158808A (en) * | 2021-03-24 | 2021-07-23 | 华南理工大学 | Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction |
CN112883980A (en) * | 2021-04-28 | 2021-06-01 | 明品云(北京)数据科技有限公司 | Data processing method and system |
CN113191358A (en) * | 2021-05-31 | 2021-07-30 | 上海交通大学 | Metal part surface text detection method and system |
CN113361666A (en) * | 2021-06-15 | 2021-09-07 | 浪潮金融信息技术有限公司 | Handwritten character recognition method, system and medium |
CN113569833A (en) * | 2021-07-27 | 2021-10-29 | 平安科技(深圳)有限公司 | Text document-based character recognition method, device, equipment and storage medium |
CN113609292A (en) * | 2021-08-09 | 2021-11-05 | 上海交通大学 | Known false news intelligent detection method based on graph structure |
CN114265937A (en) * | 2021-12-24 | 2022-04-01 | 中国电力科学研究院有限公司 | Intelligent classification analysis method and system of scientific and technological information, storage medium and server |
US11557323B1 (en) * | 2022-03-15 | 2023-01-17 | My Job Matcher, Inc. | Apparatuses and methods for selectively inserting text into a video resume |
CN114973228A (en) * | 2022-05-31 | 2022-08-30 | 上海交通大学 | Metal part surface text recognition method and system based on contour feature enhancement |
CN115114908A (en) * | 2022-07-04 | 2022-09-27 | 上海交通大学 | Text intention recognition method and system based on attention mechanism, vehicle and equipment |
CN115546796A (en) * | 2022-09-22 | 2022-12-30 | 中电智元数据科技有限公司 | Non-contact data acquisition method and system based on visual computation |
CN115471851A (en) * | 2022-10-11 | 2022-12-13 | 小语智能信息科技(云南)有限公司 | Burma language image text recognition method and device fused with double attention mechanism |
CN115731550A (en) * | 2022-11-23 | 2023-03-03 | 大连医科大学附属第二医院 | Deep learning-based automatic drug specification identification method and system and storage medium |
CN115953788A (en) * | 2022-12-02 | 2023-04-11 | 兴业银行股份有限公司 | Green financial attribute intelligent identification method and system based on OCR (optical character recognition) and NLP (non-line-segment) technologies |
CN115880566A (en) * | 2022-12-16 | 2023-03-31 | 李宜义 | Intelligent marking system based on visual analysis |
CN115862045A (en) * | 2023-02-16 | 2023-03-28 | 中国人民解放军总医院第一医学中心 | Case automatic identification method, system, equipment and storage medium based on image-text identification technology |
Non-Patent Citations (2)
Title |
---|
Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification;Wang, Peng等;《Neurocomputing》;20161231;第174卷;806-814 * |
离线手写汉字分级识别算法的研究与实现;苟晓攀;《中国优秀硕士学位论文全文数据库 信息科技辑》;20230115;I138-2184 * |
Also Published As
Publication number | Publication date |
---|---|
CN116912845A (en) | 2023-10-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108664996B (en) | Ancient character recognition method and system based on deep learning | |
CN111126386B (en) | Sequence domain adaptation method based on countermeasure learning in scene text recognition | |
CN110503103B (en) | Character segmentation method in text line based on full convolution neural network | |
CN111738169B (en) | Handwriting formula recognition method based on end-to-end network model | |
US20200134382A1 (en) | Neural network training utilizing specialized loss functions | |
CN112418320A (en) | Enterprise association relation identification method and device and storage medium | |
CN111461121A (en) | Electric meter number identification method based on YO L OV3 network | |
CN113569048A (en) | Method and system for automatically dividing affiliated industries based on enterprise operation range | |
CN116912845B (en) | Intelligent content identification and analysis method and device based on NLP and AI | |
CN116543391A (en) | Text data acquisition system and method combined with image correction | |
CN115100509B (en) | Image identification method and system based on multi-branch block-level attention enhancement network | |
CN111401434A (en) | Image classification method based on unsupervised feature learning | |
US11715288B2 (en) | Optical character recognition using specialized confidence functions | |
Nath et al. | Improving various offline techniques used for handwritten character recognition: a review | |
CN112766082B (en) | Chinese text handwriting identification method and device based on macro-micro characteristics and storage medium | |
CN114758340A (en) | Intelligent identification method, device and equipment for logistics address and storage medium | |
CN112434145A (en) | Picture-viewing poetry method based on image recognition and natural language processing | |
CN113094567A (en) | Malicious complaint identification method and system based on text clustering | |
CN113033567A (en) | Oracle bone rubbing image character extraction method fusing segmentation network and generation network | |
CN112926670A (en) | Garbage classification system and method based on transfer learning | |
CN117371533B (en) | Method and device for generating data tag rule | |
CN117496531B (en) | Construction method of convolution self-encoder capable of reducing Chinese character recognition resource overhead | |
CN112819205B (en) | Method, device and system for predicting working hours | |
CN114548325B (en) | Zero sample relation extraction method and system based on dual contrast learning | |
Alginahi | Computer analysis of composite documents with non-uniform background. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |