CN116912845B - Intelligent content identification and analysis method and device based on NLP and AI - Google Patents

Intelligent content identification and analysis method and device based on NLP and AI Download PDF

Info

Publication number
CN116912845B
CN116912845B CN202310726304.5A CN202310726304A CN116912845B CN 116912845 B CN116912845 B CN 116912845B CN 202310726304 A CN202310726304 A CN 202310726304A CN 116912845 B CN116912845 B CN 116912845B
Authority
CN
China
Prior art keywords
image
processing
text information
feature
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310726304.5A
Other languages
Chinese (zh)
Other versions
CN116912845A (en
Inventor
杜家兵
王晶
宋才华
吴丽贤
皇甫汉聪
关兆雄
陈旭宇
庞伟林
庞维欣
李仰杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Foshan Power Supply Bureau of Guangdong Power Grid Corp
Original Assignee
Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Foshan Power Supply Bureau of Guangdong Power Grid Corp filed Critical Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority to CN202310726304.5A priority Critical patent/CN116912845B/en
Publication of CN116912845A publication Critical patent/CN116912845A/en
Application granted granted Critical
Publication of CN116912845B publication Critical patent/CN116912845B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/1607Correcting image deformation, e.g. trapezoidal deformation caused by perspective
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/162Quantising the image signal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/164Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/168Smoothing or thinning of the pattern; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/1801Detecting partial patterns, e.g. edges or contours, or configurations, e.g. loops, corners, strokes or intersections
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses an intelligent content identification and analysis method and device based on NLP and AI, wherein the method comprises the following steps: acquiring an original file image, and optimizing the file image by using an image enhancement processing technology to acquire an optimized file image; processing the file image after optimization processing based on an image character cutting method to obtain an independent character image; recognizing the independent character image based on a character recognition algorithm of the binary image to obtain text information; carrying out data preprocessing on the text information to obtain preprocessed text information; performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors; the feature vector is input into a language model, and the text information is classified and analyzed based on the language model. The invention adopts NLP and AI technology, realizes intelligent content identification and analysis, provides rich information input, and provides big data support for intelligent operation of enterprises.

Description

Intelligent content identification and analysis method and device based on NLP and AI
Technical Field
The invention relates to the technical field of natural language processing and artificial intelligence, in particular to an intelligent content identification and analysis method and device based on NLP and AI.
Background
At present, AI technology is vigorously developed, AI products are widely applied to daily life of people, and in the process of AI technology development, one technology plays an indispensable role, namely NLP technology. NLP technology is a very important direction in the field of artificial intelligence today, and its purpose is to achieve efficient communication between humans and computer programs using natural language. At present, a rough management mode of taking the whole electronic file as a management unit is adopted in a traditional service system, and in order to change the mode, NLP and AI technologies are adopted, so that intelligent content identification and analysis are realized, rich information input is provided for various service systems, and therefore large data support is provided for intelligent operation of enterprises.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, and provides an intelligent content identification and analysis method and device based on NLP and AI, which adopt NLP and AI technology to realize intelligent content identification and analysis and provide rich information input for various business systems, thereby providing large data support for intelligent operation of enterprises.
In order to solve the technical problems, an embodiment of the present invention provides an intelligent content identification and analysis method based on NLP and AI, the method comprising:
Acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
carrying out data preprocessing on the text information to obtain preprocessed text information;
performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
Optionally, the obtaining the original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image, including:
acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image;
performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing;
Performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing;
and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
Optionally, the image-based character cutting method processes the file image after the optimization processing to obtain an independent character image, and includes:
performing binarization processing on the file image after optimization processing to obtain a binarized image;
carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton;
and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
Optionally, the binary image-based character recognition algorithm performs recognition processing on the independent character image to obtain text information, and includes:
performing binarization processing on the independent character image to obtain a binarized independent character image;
extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features;
Inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters;
and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.
Optionally, the performing data preprocessing on the text information to obtain preprocessed text information includes:
carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment;
word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained;
and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
Optionally, the text vectorization processing is performed on the preprocessed text information based on the feature extraction method to obtain feature vectors, which includes:
carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors;
and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.
Optionally, the structure of the CNN model includes:
the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector.
Optionally, the inputting the feature vector into a language model, and performing classification and parsing processing of text information based on the language model includes:
constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer;
training the initial language model to obtain a trained language model;
and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
Optionally, the training process of the language model includes:
collecting text information data, and labeling the text information data to obtain labeled text information data;
performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels;
importing the feature vector with the label into an initial language model;
dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set;
training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model;
Testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model;
and optimizing the model based on the result of the test set.
In addition, the embodiment of the invention also provides an intelligent content recognition and analysis device based on NLP and AI, which comprises:
and the optimization processing module is used for: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
and a character cutting module: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
and a character recognition module: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
and a pretreatment module: carrying out data preprocessing on the text information to obtain preprocessed text information;
and the feature extraction module is used for: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
classification and analysis module: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings which are required in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow diagram of an intelligent content recognition and analysis method based on NLP and AI;
Fig. 2 is a schematic structural diagram of an intelligent content recognition and analysis device based on NLP and AI.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
Referring to fig. 1, fig. 1 is a flow chart of an intelligent content recognition and analysis method based on NLP and AI in an embodiment of the invention.
As shown in fig. 1, a method for identifying and analyzing intelligent content based on NLP and AI, the method comprising:
s11: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
in the implementation process of the invention, the method for obtaining the original file image and optimizing the original file image by using the image enhancement processing technology to obtain the optimized file image comprises the following steps: acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image; performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing; performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing; and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
Specifically, an original file image is obtained, and denoising is carried out on the original file image by using a median filtering method, wherein the median filtering method is used for replacing the value of one point in the image with the median value of each point value in one field of the point, surrounding pixel values are close to a true value, isolated noise points are eliminated, and the denoised file image is obtained; carrying out enhanced image edge processing on the de-noised file image based on a Hilbert transform method, firstly modulating the object field of the file image in the frequency domain, carrying out Fourier transform on the frequency domain, taking the imaginary part of the frequency domain after the Fourier transform, taking the absolute value, calculating the 1/n power after summing along the frequency, and carrying out edge enhancement along the vertical axis and the horizontal axis respectively to obtain the file image after the enhanced image edge processing; performing bending correction processing on a file image subjected to edge processing of an enhanced image based on an offset field method, firstly forming an offset field through a deformation correction network, and enabling the offset field to perform corresponding displacement on each pixel of the file image to obtain the file image subjected to bending correction processing; performing handwriting erasing processing on the file image subjected to the bending correction processing based on a handwriting erasing clustering method, automatically classifying similar objects into the same cluster by continuously taking data of the nearest mean value from the centroid of the image, circularly executing until the clustering is completed, and correcting the part of the image needing handwriting erasing after the clustering is completed to obtain the file image subjected to the optimization processing; the image enhancement technology is used for processing the image, so that the useful information of the image can be enhanced, and the quality of the image can be improved, so that the image can be further processed.
S12: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
in the implementation process of the invention, the method for cutting the characters based on the images processes the file images after the optimization processing to obtain independent character images, and comprises the following steps: performing binarization processing on the file image after optimization processing to obtain a binarized image; carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton; and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
Specifically, binarizing the file image after optimization, graying the image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of the gray level range or the average value of the image, so as to obtain a binarized image which can still reflect the whole and local characteristics of the image, and the binarized image is convenient for the subsequent processing of a thinning algorithm; firstly, setting a neighborhood template based on a refinement algorithm to refine the binarized image, judging whether each binarized image pixel point meets the following conditions, wherein the first condition bit ensures that the black points [2,6] of a central pixel are between, the second condition ensures that the deletion of the pixel point does not influence the connectivity of the image, the third condition is that the deletion of the pixel point does not generate horizontal line fracture, the fourth condition is that the deletion of the pixel point does not generate vertical line fracture, and the pixel point is deleted when the four conditions are met, and repeating the deletion operation until the pixel point which does not meet the conditions is obtained; after a character skeleton is obtained by using a thinning algorithm, positioning crossing points according to the character skeleton, setting an optimal interval, superposing character images according to the position relation of the character images on an original image, using thickening processing to obtain a dividing line, judging the length of the crossing points from the dividing line, defining a fault tolerance for the length, wherein crossing points in the fault tolerance are necessary crossing points, eliminating unnecessary crossing points, and performing cutting operation to obtain independent character images; the image is cut to provide for convenient subsequent character recognition, so that the subsequent recognition processing can be performed more quickly.
S13: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
in the implementation process of the invention, the character recognition algorithm based on the binary image carries out recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image; extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features; inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters; and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.
Specifically, binarizing the independent character image, graying the independent character image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of a gray level range or an average value of the image, and the like, so that the binarized independent character image which can still reflect the whole and local characteristics of the image is obtained, and the subsequent characteristic extraction processing can be continued after the image is binarized; extracting dot matrix features, feature lines and grid features of the binary independent character image based on a feature extraction method, placing the binary independent character image in a coordinate system, determining a dot value according to the existence or non-existence of pixels on the coordinate system, splicing each eight pixel values into bytes to obtain dot matrix features, counting the number of line segments of even lines and columns to form a two-dimensional feature vector, recombining dot matrixes to obtain feature lines, partitioning the dot matrix structure, counting the number of foreground pixels in each partition to serve as statistical features to obtain grid features, wherein the dot matrix features reflect the overall features of the character image, the feature lines and the grid features reflect the local features of the character image, and the features have complementary relations; training a sample by using a BP neural network and using a log sigmode linear function, and setting the maximum iteration times, wherein the structure of the BP neural network comprises an input layer, a hidden layer and an output layer, and characteristic information of a binarized independent character image is input into the trained BP neural network to perform character classification processing to obtain classified characters; the classified characters are subjected to fine classification recognition processing based on a hierarchical recognition algorithm, classification is firstly carried out through clustering processing, classified character images are input into corresponding nodes of a tree structure to finish coarse classification, probabilistic averageing functions are used for integration, and sub-nodes of the results of the coarse classification are further classified to obtain recognition results, text information is obtained, and better recognition accuracy can be obtained through the hierarchical recognition algorithm.
S14: carrying out data preprocessing on the text information to obtain preprocessed text information;
in the implementation process of the invention, the data preprocessing of the text information to obtain preprocessed text information comprises the following steps: carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment; word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained; and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
Specifically, corpus cleaning is carried out on the text information, character strings in the text information are matched with character strings conforming to rules by utilizing regular matching rules, special characters, repeated data and stop words are removed, and the text information after corpus cleaning is obtained; word segmentation processing is carried out on the text information after data cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently arranged in the corpus is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that the text information after word segmentation processing is obtained; part-of-speech tagging is carried out on text information after word segmentation, part-of-speech tagging is regarded as a sequence tagging, a unit sequence is given, a tag is allocated to each unit in the sequence, probability distribution of possible tag sequences is calculated, the best tag sequence is selected, the most probable part of speech of a word is judged, and then tagging is carried out, so that the text information after part-of-speech tagging is obtained; the text information is preprocessed, so that the information is cleaner, more accurate and more reliable.
S15: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
in the implementation process of the invention, the text vectorization processing is carried out on the preprocessed text information based on the feature extraction method to obtain the feature vector, which comprises the following steps: carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors; and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.
Specifically, an Embedding algorithm is adopted to conduct numerical processing on the preprocessed text information, each word of the preprocessed text information is used as a feature column through a one-hot function and mapped into a word vector space to obtain an original word vector, and then dimension reduction is conducted on the original word vector to obtain a word vector which can be input into an input layer of a CNN model; the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector; and converting the one-dimensional text information into a two-dimensional input vector so as to meet the input requirement of the model.
S16: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
In the implementation process of the invention, the step of inputting the feature vector into a language model and carrying out classification and analysis processing of text information based on the language model comprises the following steps: constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer; training the initial language model to obtain a trained language model; and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
Specifically, the language model structure includes: the input layer inputs the feature vector, the convolution layer carries out convolution processing on the input feature vector, wherein the used activation function is a Relu function, the convolution layer comprises a plurality of convolution kernels, and the convolution kernels in the convolution layer only slide in one direction; the Pooling layer carries out maximum Pooling treatment on the convolved feature vector, and the Pooling layer only carries out max-Pooling operation in one direction; the full-connection layer carries out cascading treatment on the pooled feature vectors, wherein dropout regularization is added, and overfitting is prevented; the output layer uses a softmax activation function to compress the dimensions of the number of categories of the vector, obtain the probability of classifying the text into different categories, and obtain text semantic information; the textCNN model has the advantages of simpler model and high training speed, and can achieve good effect.
Specifically, the training process of the language model includes: collecting text information data, and labeling the text information data to obtain labeled text information data; performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels; importing the feature vector with the label into an initial language model; dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set; training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model; testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model; optimizing the model based on the result of the test set, wherein the optimizing the model comprises sequentially calculating and weighting each layer from the input layer in the forward propagation process aiming at the divided data set, and obtaining a preliminary output result through a nonlinear function, namely an activation function; calculating an error value by using the loss function, and carrying out back propagation according to the error value; at the end of each batch processing, the parameters of the model are updated using the optimizer.
In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of an intelligent content recognition and analysis device based on NLP and AI in an embodiment of the invention.
As shown in fig. 2, a method for identifying and analyzing intelligent content based on NLP and AI, the method comprises:
the optimization processing module 21: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
In the implementation process of the invention, the method for obtaining the original file image and optimizing the original file image by using the image enhancement processing technology to obtain the optimized file image comprises the following steps: acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image; performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing; performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing; and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
Specifically, an original file image is obtained, and denoising is carried out on the original file image by using a median filtering method, wherein the median filtering method is used for replacing the value of one point in the image with the median value of each point value in one field of the point, surrounding pixel values are close to a true value, isolated noise points are eliminated, and the denoised file image is obtained; carrying out enhanced image edge processing on the de-noised file image based on a Hilbert transform method, firstly modulating the object field of the file image in the frequency domain, carrying out Fourier transform on the frequency domain, taking the imaginary part of the frequency domain after the Fourier transform, taking the absolute value, calculating the 1/n power after summing along the frequency, and carrying out edge enhancement along the vertical axis and the horizontal axis respectively to obtain the file image after the enhanced image edge processing; performing bending correction processing on a file image subjected to edge processing of an enhanced image based on an offset field method, firstly forming an offset field through a deformation correction network, and enabling the offset field to perform corresponding displacement on each pixel of the file image to obtain the file image subjected to bending correction processing; performing handwriting erasing processing on the file image subjected to the bending correction processing based on a handwriting erasing clustering method, automatically classifying similar objects into the same cluster by continuously taking data of the nearest mean value from the centroid of the image, circularly executing until the clustering is completed, and correcting the part of the image needing handwriting erasing after the clustering is completed to obtain the file image subjected to the optimization processing; the image enhancement technology is used for processing the image, so that the useful information of the image can be enhanced, and the quality of the image can be improved, so that the image can be further processed.
Character cutting module 22: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
in the implementation process of the invention, the method for cutting the characters based on the images processes the file images after the optimization processing to obtain independent character images, and comprises the following steps: performing binarization processing on the file image after optimization processing to obtain a binarized image; carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton; and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
Specifically, binarizing the file image after optimization, graying the image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of the gray level range or the average value of the image, so as to obtain a binarized image which can still reflect the whole and local characteristics of the image, and the binarized image is convenient for the subsequent processing of a thinning algorithm; firstly, setting a neighborhood template based on a refinement algorithm to refine the binarized image, judging whether each binarized image pixel point meets the following conditions, wherein the first condition bit ensures that the black points [2,6] of a central pixel are between, the second condition ensures that the deletion of the pixel point does not influence the connectivity of the image, the third condition is that the deletion of the pixel point does not generate horizontal line fracture, the fourth condition is that the deletion of the pixel point does not generate vertical line fracture, and the pixel point is deleted when the four conditions are met, and repeating the deletion operation until the pixel point which does not meet the conditions is obtained; after a character skeleton is obtained by using a thinning algorithm, positioning crossing points according to the character skeleton, setting an optimal interval, superposing character images according to the position relation of the character images on an original image, using thickening processing to obtain a dividing line, judging the length of the crossing points from the dividing line, defining a fault tolerance for the length, wherein crossing points in the fault tolerance are necessary crossing points, eliminating unnecessary crossing points, and performing cutting operation to obtain independent character images; the image is cut to provide for convenient subsequent character recognition, so that the subsequent recognition processing can be performed more quickly.
Character recognition module 23: performing recognition processing on the independent character images based on a character recognition algorithm of the binary images to obtain text information;
in the implementation process of the invention, the character recognition algorithm based on the binary image carries out recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image; extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises lattice features, feature lines and grid features; inputting the characteristic information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters; and carrying out fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm to obtain text information.
Specifically, binarizing the independent character image, graying the independent character image, and then selecting a proper threshold value for the gray level image with 256 brightness levels, wherein the threshold value can be selected from half of a gray level range or an average value of the image, and the like, so that the binarized independent character image which can still reflect the whole and local characteristics of the image is obtained, and the subsequent characteristic extraction processing can be continued after the image is binarized; extracting dot matrix features, feature lines and grid features of the binary independent character image based on a feature extraction method, placing the binary independent character image in a coordinate system, determining a dot value according to the existence or non-existence of pixels on the coordinate system, splicing each eight pixel values into bytes to obtain dot matrix features, counting the number of line segments of even lines and columns to form a two-dimensional feature vector, recombining dot matrixes to obtain feature lines, partitioning the dot matrix structure, counting the number of foreground pixels in each partition to serve as statistical features to obtain grid features, wherein the dot matrix features reflect the overall features of the character image, the feature lines and the grid features reflect the local features of the character image, and the features have complementary relations; training a sample by using a BP neural network and using a log sigmode linear function, and setting the maximum iteration times, wherein the structure of the BP neural network comprises an input layer, a hidden layer and an output layer, and characteristic information of a binarized independent character image is input into the trained BP neural network to perform character classification processing to obtain classified characters; the classified characters are subjected to fine classification recognition processing based on a hierarchical recognition algorithm, classification is firstly carried out through clustering processing, classified character images are input into corresponding nodes of a tree structure to finish coarse classification, probabilistic averageing functions are used for integration, and sub-nodes of the results of the coarse classification are further classified to obtain recognition results, text information is obtained, and better recognition accuracy can be obtained through the hierarchical recognition algorithm.
Preprocessing module 24: carrying out data preprocessing on the text information to obtain preprocessed text information;
in the implementation process of the invention, the data preprocessing of the text information to obtain preprocessed text information comprises the following steps: carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment; word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained; and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
Specifically, corpus cleaning is carried out on the text information, character strings in the text information are matched with character strings conforming to rules by utilizing regular matching rules, special characters, repeated data and stop words are removed, and the text information after corpus cleaning is obtained; word segmentation processing is carried out on the text information after data cleaning, the word forming reliability is reflected by the frequency of adjacent occurrence of the words, the frequency of the combination of the words which are commonly and adjacently arranged in the corpus is counted, and when the combination frequency is higher than a critical value, the combination of the words can be considered to form a word, so that the text information after word segmentation processing is obtained; part-of-speech tagging is carried out on text information after word segmentation, part-of-speech tagging is regarded as a sequence tagging, a unit sequence is given, a tag is allocated to each unit in the sequence, probability distribution of possible tag sequences is calculated, the best tag sequence is selected, the most probable part of speech of a word is judged, and then tagging is carried out, so that the text information after part-of-speech tagging is obtained; the text information is preprocessed, so that the information is cleaner, more accurate and more reliable.
Feature extraction module 25: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain feature vectors;
in the implementation process of the invention, the text vectorization processing is carried out on the preprocessed text information based on the feature extraction method to obtain the feature vector, which comprises the following steps: carrying out numerical processing on the preprocessed text information by adopting an Embedding algorithm to obtain word vectors; and inputting the word vector into a CNN model for feature extraction to obtain a feature vector.
Specifically, an Embedding algorithm is adopted to conduct numerical processing on the preprocessed text information, each word of the preprocessed text information is used as a feature column through a one-hot function and mapped into a word vector space to obtain an original word vector, and then dimension reduction is conducted on the original word vector to obtain a word vector which can be input into an input layer of a CNN model; the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector; and converting the one-dimensional text information into a two-dimensional input vector so as to meet the input requirement of the model.
Classification and parsing module 26: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
In the implementation process of the invention, the step of inputting the feature vector into a language model and carrying out classification and analysis processing of text information based on the language model comprises the following steps: constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer; training the initial language model to obtain a trained language model; and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
Specifically, the language model structure includes: the input layer inputs the feature vector, the convolution layer carries out convolution processing on the input feature vector, wherein the used activation function is a Relu function, the convolution layer comprises a plurality of convolution kernels, and the convolution kernels in the convolution layer only slide in one direction; the Pooling layer carries out maximum Pooling treatment on the convolved feature vector, and the Pooling layer only carries out max-Pooling operation in one direction; the full-connection layer carries out cascading treatment on the pooled feature vectors, wherein dropout regularization is added, and overfitting is prevented; the output layer uses a softmax activation function to compress the dimensions of the number of categories of the vector, obtain the probability of classifying the text into different categories, and obtain text semantic information; the textCNN model has the advantages of simpler model and high training speed, and can achieve good effect.
Specifically, the training process of the language model includes: collecting text information data, and labeling the text information data to obtain labeled text information data; performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels; importing the feature vector with the label into an initial language model; dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set; training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model; testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model; optimizing the model based on the result of the test set, wherein the optimizing the model comprises sequentially calculating and weighting each layer from the input layer in the forward propagation process aiming at the divided data set, and obtaining a preliminary output result through a nonlinear function, namely an activation function; calculating an error value by using the loss function, and carrying out back propagation according to the error value; at the end of each batch processing, the parameters of the model are updated using the optimizer.
In the embodiment of the invention, NLP and AI technologies are adopted, so that the identification text content obtains better identification accuracy, errors caused by content identification are greatly reduced, and based on a deep learning method, intelligent analysis and application of mass data content in a short time are realized through three stages of corpus preprocessing, design model and model training, and the data quality is improved, manual entry work and business processes are reduced, the work efficiency is improved, the labor input is reduced, and abundant information input is provided for various business systems, so that large data support is provided for intelligent operation of enterprises.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
In addition, the foregoing has outlined rather broadly the more detailed description of embodiments of the invention in order that the detailed description of the principles and embodiments of the invention may be implemented in conjunction with the present examples, the above examples being provided to facilitate the understanding of the method and core concepts of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (8)

1. An intelligent content identification and analysis method based on NLP and AI, which is characterized by comprising the following steps:
acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
the character recognition algorithm based on the binary image performs recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image, extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises dot matrix features, feature lines and grid features, placing the binarized independent character image in a coordinate system, splicing pixel values of the binarized independent character image in the coordinate system to obtain dot matrix features, recombining two-dimensional feature vectors generated by the dot matrix features to obtain feature lines, performing structure partitioning processing on the dot matrix features to obtain grid features, inputting feature information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters, performing fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm, classifying the classified characters based on a clustering method, inputting the classified character images into nodes corresponding to a tree structure to perform coarse classification, integrating the integrated character images based on Probabilistic averageing functions, and classifying the integrated character images by using sub-nodes of coarse classification results to obtain text information;
Carrying out data preprocessing on the text information to obtain preprocessed text information;
performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain a feature vector, performing text vectorization processing on the preprocessed text information based on the feature extraction method to obtain the feature vector, wherein the method comprises the following steps: performing numerical processing on the preprocessed text information by adopting an Embedding algorithm, mapping each word in the preprocessed text information into a word vector space based on a one-hot function, obtaining an original word vector, performing dimension reduction processing on the original word vector, obtaining a word vector, inputting the word vector into a CNN model, and performing feature extraction to obtain a feature vector;
inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
2. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the obtaining the original document image and optimizing the original document image using an image enhancement processing technique to obtain an optimized document image comprises:
Acquiring an original file image, and denoising the original file image by using a median filtering method to acquire a denoised file image;
performing enhanced image edge processing on the de-noised file image based on a Hilbert transform method to obtain the file image after the enhanced image edge processing;
performing bending correction processing on the file image subjected to the enhanced image edge processing based on an offset field method to obtain a file image subjected to the bending correction processing;
and carrying out handwriting erasure processing on the file image subjected to the bending correction processing based on the handwriting erasure clustering method to obtain the file image subjected to the optimization processing.
3. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the image-based character cutting method processes the optimized file image to obtain an independent character image, comprising:
performing binarization processing on the file image after optimization processing to obtain a binarized image;
carrying out refinement treatment on the binary image based on a refinement algorithm to obtain a character skeleton;
and performing self-adaptive cutting on the optimized file image by using a self-adaptive cutting method based on the character skeleton to obtain an independent character image.
4. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the performing data preprocessing on the text information to obtain preprocessed text information comprises:
carrying out corpus cleaning treatment on the text information to obtain text information after corpus cleaning treatment;
word segmentation is carried out on the text information after data cleaning, and the text information after word segmentation is obtained;
and performing part-of-speech tagging on the text information subjected to word segmentation processing to obtain the text information subjected to part-of-speech tagging.
5. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the structure of CNN model comprises:
the CNN model structure comprises an input layer, a convolution layer, a pooling layer and an output layer, wherein the input layer inputs the word vector; the convolution layer carries out convolution processing on the input word vector by using a filter to obtain a convolved word vector, wherein the used activation function is a Relu function; the pooling layer performs pooling treatment on the convolved word vectors, and adds dropout regularization to obtain feature vectors; the output layer outputs the extracted feature vector.
6. The intelligent content recognition and analysis method based on NLP and AI of claim 1, wherein the inputting the feature vector into a language model and classifying and analyzing text information based on the language model comprises:
constructing a TEXTCNN model as an initial language model, wherein the network structure of the TEXTCNN model comprises: an input layer, a convolution layer, a pooling layer, a full connection layer and an output layer;
training the initial language model to obtain a trained language model;
and inputting the feature vectors into a trained language model, and classifying and analyzing the text information.
7. The intelligent content recognition and analysis method based on NLP and AI of claim 6, wherein the training process of the language model comprises:
collecting text information data, and labeling the text information data to obtain labeled text information data;
performing text vectorization processing on the text information data with the labels to obtain feature vectors with the labels;
importing the feature vector with the label into an initial language model;
dividing the feature vector with the label in the initial language model into a training set, a verification set and a test set;
Training the initial language model by using a training set, verifying the accuracy of the model by using a verification set after each round of training, and adjusting model parameters according to the result of the verification set to obtain a trained language model;
testing the trained language model by using a test set, and verifying generalization capability and accuracy of the trained language model;
and optimizing the model based on the result of the test set.
8. An intelligent content recognition and analysis device based on NLP and AI, the device comprising:
and the optimization processing module is used for: acquiring an original file image, and performing optimization processing on the original file image by using an image enhancement processing technology to obtain an optimized file image;
and a character cutting module: processing the file image after optimization processing based on an image character cutting method to obtain independent character images;
and a character recognition module: the character recognition algorithm based on the binary image performs recognition processing on the independent character image to obtain text information, and the method comprises the following steps: performing binarization processing on the independent character image to obtain a binarized independent character image, extracting feature information of the binarized independent character image based on a feature extraction method, wherein the feature information comprises dot matrix features, feature lines and grid features, placing the binarized independent character image in a coordinate system, splicing pixel values of the binarized independent character image in the coordinate system to obtain dot matrix features, recombining two-dimensional feature vectors generated by the dot matrix features to obtain feature lines, performing structure partitioning processing on the dot matrix features to obtain grid features, inputting feature information of the binarized independent character image into a BP neural network to perform character classification processing to obtain classified characters, performing fine classification recognition processing on the classified characters based on a hierarchical recognition algorithm, classifying the classified characters based on a clustering method, inputting the classified character images into nodes corresponding to a tree structure to perform coarse classification, integrating the integrated character images based on Probabilistic averageing functions, and classifying the integrated character images by using sub-nodes of coarse classification results to obtain text information;
And a pretreatment module: carrying out data preprocessing on the text information to obtain preprocessed text information;
and the feature extraction module is used for: performing text vectorization processing on the preprocessed text information based on a feature extraction method to obtain a feature vector, performing text vectorization processing on the preprocessed text information based on the feature extraction method to obtain the feature vector, wherein the method comprises the following steps: performing numerical processing on the preprocessed text information by adopting an Embedding algorithm, mapping each word in the preprocessed text information into a word vector space based on a one-hot function, obtaining an original word vector, performing dimension reduction processing on the original word vector, obtaining a word vector, inputting the word vector into a CNN model, and performing feature extraction to obtain a feature vector;
classification and analysis module: inputting the feature vector into a language model, and classifying and analyzing text information based on the language model.
CN202310726304.5A 2023-06-16 2023-06-16 Intelligent content identification and analysis method and device based on NLP and AI Active CN116912845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310726304.5A CN116912845B (en) 2023-06-16 2023-06-16 Intelligent content identification and analysis method and device based on NLP and AI

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310726304.5A CN116912845B (en) 2023-06-16 2023-06-16 Intelligent content identification and analysis method and device based on NLP and AI

Publications (2)

Publication Number Publication Date
CN116912845A CN116912845A (en) 2023-10-20
CN116912845B true CN116912845B (en) 2024-03-19

Family

ID=88359159

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310726304.5A Active CN116912845B (en) 2023-06-16 2023-06-16 Intelligent content identification and analysis method and device based on NLP and AI

Country Status (1)

Country Link
CN (1) CN116912845B (en)

Citations (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329731A (en) * 2008-06-06 2008-12-24 南开大学 Automatic recognition method pf mathematical formula in image
CN101329734A (en) * 2008-07-31 2008-12-24 重庆大学 License plate character recognition method based on K-L transform and LS-SVM
CN102024144A (en) * 2010-11-23 2011-04-20 上海海事大学 Container number identification method
CN102509091A (en) * 2011-11-29 2012-06-20 北京航空航天大学 Airplane tail number recognition method
US8699796B1 (en) * 2008-11-11 2014-04-15 Trend Micro Incorporated Identifying sensitive expressions in images for languages with large alphabets
CA2920795A1 (en) * 2014-02-07 2015-08-13 Cellular South, Inc Dba C Spire Wire Wireless Video to data
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN108564079A (en) * 2018-05-08 2018-09-21 东华大学 A kind of portable character recognition device and method
CN109255113A (en) * 2018-09-04 2019-01-22 郑州信大壹密科技有限公司 Intelligent critique system
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system
CN109376658A (en) * 2018-10-26 2019-02-22 信雅达系统工程股份有限公司 A kind of OCR method based on deep learning
CN109460735A (en) * 2018-11-09 2019-03-12 中国科学院自动化研究所 Document binary processing method, system, device based on figure semi-supervised learning
CN110059694A (en) * 2019-04-19 2019-07-26 山东大学 The intelligent identification Method of lteral data under power industry complex scene
CN110297907A (en) * 2019-06-28 2019-10-01 谭浩 Generate method, computer readable storage medium and the terminal device of interview report
CN110298376A (en) * 2019-05-16 2019-10-01 西安电子科技大学 A kind of bank money image classification method based on improvement B-CNN
WO2019195736A2 (en) * 2018-04-05 2019-10-10 Chevron U.S.A. Inc. Classification of piping and instrumental diagram information using machine-learning
CN110363194A (en) * 2019-06-17 2019-10-22 深圳壹账通智能科技有限公司 Intelligently reading method, apparatus, equipment and storage medium based on NLP
CN110457466A (en) * 2019-06-28 2019-11-15 谭浩 Generate method, computer readable storage medium and the terminal device of interview report
CN110889402A (en) * 2019-11-04 2020-03-17 广州丰石科技有限公司 Business license content identification method and system based on deep learning
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
WO2020140386A1 (en) * 2019-01-02 2020-07-09 平安科技(深圳)有限公司 Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium
WO2020147393A1 (en) * 2019-01-17 2020-07-23 平安科技(深圳)有限公司 Convolutional neural network-based text classification method, and related device
CN112085024A (en) * 2020-09-21 2020-12-15 江苏理工学院 Tank surface character recognition method
WO2020253043A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Intelligent text classification method and apparatus, and computer-readable storage medium
CN112329779A (en) * 2020-11-02 2021-02-05 平安科技(深圳)有限公司 Method and related device for improving certificate identification accuracy based on mask
WO2021025290A1 (en) * 2019-08-06 2021-02-11 삼성전자 주식회사 Method and electronic device for converting handwriting input to text
EP3812950A1 (en) * 2019-10-23 2021-04-28 Tata Consultancy Services Limited Method and system for creating an intelligent cartoon comic strip based on dynamic content
CN112883980A (en) * 2021-04-28 2021-06-01 明品云(北京)数据科技有限公司 Data processing method and system
WO2021137166A1 (en) * 2019-12-30 2021-07-08 L&T Technology Services Limited Domain based text extraction
CN113158808A (en) * 2021-03-24 2021-07-23 华南理工大学 Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
CN113191358A (en) * 2021-05-31 2021-07-30 上海交通大学 Metal part surface text detection method and system
CN113361666A (en) * 2021-06-15 2021-09-07 浪潮金融信息技术有限公司 Handwritten character recognition method, system and medium
CN113569833A (en) * 2021-07-27 2021-10-29 平安科技(深圳)有限公司 Text document-based character recognition method, device, equipment and storage medium
CN113609292A (en) * 2021-08-09 2021-11-05 上海交通大学 Known false news intelligent detection method based on graph structure
CN114265937A (en) * 2021-12-24 2022-04-01 中国电力科学研究院有限公司 Intelligent classification analysis method and system of scientific and technological information, storage medium and server
CN114973228A (en) * 2022-05-31 2022-08-30 上海交通大学 Metal part surface text recognition method and system based on contour feature enhancement
WO2022178919A1 (en) * 2021-02-23 2022-09-01 西安交通大学 Taxpayer industry classification method based on noise label learning
CN115114908A (en) * 2022-07-04 2022-09-27 上海交通大学 Text intention recognition method and system based on attention mechanism, vehicle and equipment
CN115471851A (en) * 2022-10-11 2022-12-13 小语智能信息科技(云南)有限公司 Burma language image text recognition method and device fused with double attention mechanism
CN115546796A (en) * 2022-09-22 2022-12-30 中电智元数据科技有限公司 Non-contact data acquisition method and system based on visual computation
US11557323B1 (en) * 2022-03-15 2023-01-17 My Job Matcher, Inc. Apparatuses and methods for selectively inserting text into a video resume
CN115731550A (en) * 2022-11-23 2023-03-03 大连医科大学附属第二医院 Deep learning-based automatic drug specification identification method and system and storage medium
CN115862045A (en) * 2023-02-16 2023-03-28 中国人民解放军总医院第一医学中心 Case automatic identification method, system, equipment and storage medium based on image-text identification technology
CN115880566A (en) * 2022-12-16 2023-03-31 李宜义 Intelligent marking system based on visual analysis
CN115953788A (en) * 2022-12-02 2023-04-11 兴业银行股份有限公司 Green financial attribute intelligent identification method and system based on OCR (optical character recognition) and NLP (non-line-segment) technologies

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10218954B2 (en) * 2013-08-15 2019-02-26 Cellular South, Inc. Video to data
US10372816B2 (en) * 2016-12-13 2019-08-06 International Business Machines Corporation Preprocessing of string inputs in natural language processing
US11295123B2 (en) * 2017-09-14 2022-04-05 Chevron U.S.A. Inc. Classification of character strings using machine-learning
US11218500B2 (en) * 2019-07-31 2022-01-04 Secureworks Corp. Methods and systems for automated parsing and identification of textual data
US11651157B2 (en) * 2020-07-29 2023-05-16 Descript, Inc. Filler word detection through tokenizing and labeling of transcripts
US11790025B2 (en) * 2021-03-30 2023-10-17 Streem, Llc Extracting data and metadata from identification labels using natural language processing

Patent Citations (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101329731A (en) * 2008-06-06 2008-12-24 南开大学 Automatic recognition method pf mathematical formula in image
CN101329734A (en) * 2008-07-31 2008-12-24 重庆大学 License plate character recognition method based on K-L transform and LS-SVM
US8699796B1 (en) * 2008-11-11 2014-04-15 Trend Micro Incorporated Identifying sensitive expressions in images for languages with large alphabets
CN102024144A (en) * 2010-11-23 2011-04-20 上海海事大学 Container number identification method
CN102509091A (en) * 2011-11-29 2012-06-20 北京航空航天大学 Airplane tail number recognition method
CA2920795A1 (en) * 2014-02-07 2015-08-13 Cellular South, Inc Dba C Spire Wire Wireless Video to data
WO2018086519A1 (en) * 2016-11-08 2018-05-17 北京国双科技有限公司 Method and device for identifying specific text information
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
WO2019195736A2 (en) * 2018-04-05 2019-10-10 Chevron U.S.A. Inc. Classification of piping and instrumental diagram information using machine-learning
CN108564079A (en) * 2018-05-08 2018-09-21 东华大学 A kind of portable character recognition device and method
CN109255113A (en) * 2018-09-04 2019-01-22 郑州信大壹密科技有限公司 Intelligent critique system
CN109359293A (en) * 2018-09-13 2019-02-19 内蒙古大学 Mongolian name entity recognition method neural network based and its identifying system
CN109376658A (en) * 2018-10-26 2019-02-22 信雅达系统工程股份有限公司 A kind of OCR method based on deep learning
CN109460735A (en) * 2018-11-09 2019-03-12 中国科学院自动化研究所 Document binary processing method, system, device based on figure semi-supervised learning
WO2020140386A1 (en) * 2019-01-02 2020-07-09 平安科技(深圳)有限公司 Textcnn-based knowledge extraction method and apparatus, and computer device and storage medium
WO2020147393A1 (en) * 2019-01-17 2020-07-23 平安科技(深圳)有限公司 Convolutional neural network-based text classification method, and related device
CN110059694A (en) * 2019-04-19 2019-07-26 山东大学 The intelligent identification Method of lteral data under power industry complex scene
CN110298376A (en) * 2019-05-16 2019-10-01 西安电子科技大学 A kind of bank money image classification method based on improvement B-CNN
CN110363194A (en) * 2019-06-17 2019-10-22 深圳壹账通智能科技有限公司 Intelligently reading method, apparatus, equipment and storage medium based on NLP
WO2020253043A1 (en) * 2019-06-20 2020-12-24 平安科技(深圳)有限公司 Intelligent text classification method and apparatus, and computer-readable storage medium
CN110457466A (en) * 2019-06-28 2019-11-15 谭浩 Generate method, computer readable storage medium and the terminal device of interview report
CN110297907A (en) * 2019-06-28 2019-10-01 谭浩 Generate method, computer readable storage medium and the terminal device of interview report
WO2021025290A1 (en) * 2019-08-06 2021-02-11 삼성전자 주식회사 Method and electronic device for converting handwriting input to text
EP3812950A1 (en) * 2019-10-23 2021-04-28 Tata Consultancy Services Limited Method and system for creating an intelligent cartoon comic strip based on dynamic content
CN110889402A (en) * 2019-11-04 2020-03-17 广州丰石科技有限公司 Business license content identification method and system based on deep learning
CN111046946A (en) * 2019-12-10 2020-04-21 昆明理工大学 Burma language image text recognition method based on CRNN
WO2021137166A1 (en) * 2019-12-30 2021-07-08 L&T Technology Services Limited Domain based text extraction
CN112085024A (en) * 2020-09-21 2020-12-15 江苏理工学院 Tank surface character recognition method
CN112329779A (en) * 2020-11-02 2021-02-05 平安科技(深圳)有限公司 Method and related device for improving certificate identification accuracy based on mask
WO2022178919A1 (en) * 2021-02-23 2022-09-01 西安交通大学 Taxpayer industry classification method based on noise label learning
CN113158808A (en) * 2021-03-24 2021-07-23 华南理工大学 Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
CN112883980A (en) * 2021-04-28 2021-06-01 明品云(北京)数据科技有限公司 Data processing method and system
CN113191358A (en) * 2021-05-31 2021-07-30 上海交通大学 Metal part surface text detection method and system
CN113361666A (en) * 2021-06-15 2021-09-07 浪潮金融信息技术有限公司 Handwritten character recognition method, system and medium
CN113569833A (en) * 2021-07-27 2021-10-29 平安科技(深圳)有限公司 Text document-based character recognition method, device, equipment and storage medium
CN113609292A (en) * 2021-08-09 2021-11-05 上海交通大学 Known false news intelligent detection method based on graph structure
CN114265937A (en) * 2021-12-24 2022-04-01 中国电力科学研究院有限公司 Intelligent classification analysis method and system of scientific and technological information, storage medium and server
US11557323B1 (en) * 2022-03-15 2023-01-17 My Job Matcher, Inc. Apparatuses and methods for selectively inserting text into a video resume
CN114973228A (en) * 2022-05-31 2022-08-30 上海交通大学 Metal part surface text recognition method and system based on contour feature enhancement
CN115114908A (en) * 2022-07-04 2022-09-27 上海交通大学 Text intention recognition method and system based on attention mechanism, vehicle and equipment
CN115546796A (en) * 2022-09-22 2022-12-30 中电智元数据科技有限公司 Non-contact data acquisition method and system based on visual computation
CN115471851A (en) * 2022-10-11 2022-12-13 小语智能信息科技(云南)有限公司 Burma language image text recognition method and device fused with double attention mechanism
CN115731550A (en) * 2022-11-23 2023-03-03 大连医科大学附属第二医院 Deep learning-based automatic drug specification identification method and system and storage medium
CN115953788A (en) * 2022-12-02 2023-04-11 兴业银行股份有限公司 Green financial attribute intelligent identification method and system based on OCR (optical character recognition) and NLP (non-line-segment) technologies
CN115880566A (en) * 2022-12-16 2023-03-31 李宜义 Intelligent marking system based on visual analysis
CN115862045A (en) * 2023-02-16 2023-03-28 中国人民解放军总医院第一医学中心 Case automatic identification method, system, equipment and storage medium based on image-text identification technology

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification;Wang, Peng等;《Neurocomputing》;20161231;第174卷;806-814 *
离线手写汉字分级识别算法的研究与实现;苟晓攀;《中国优秀硕士学位论文全文数据库 信息科技辑》;20230115;I138-2184 *

Also Published As

Publication number Publication date
CN116912845A (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN108664996B (en) Ancient character recognition method and system based on deep learning
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN110503103B (en) Character segmentation method in text line based on full convolution neural network
CN111738169B (en) Handwriting formula recognition method based on end-to-end network model
US20200134382A1 (en) Neural network training utilizing specialized loss functions
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN111461121A (en) Electric meter number identification method based on YO L OV3 network
CN113569048A (en) Method and system for automatically dividing affiliated industries based on enterprise operation range
CN116912845B (en) Intelligent content identification and analysis method and device based on NLP and AI
CN116543391A (en) Text data acquisition system and method combined with image correction
CN115100509B (en) Image identification method and system based on multi-branch block-level attention enhancement network
CN111401434A (en) Image classification method based on unsupervised feature learning
US11715288B2 (en) Optical character recognition using specialized confidence functions
Nath et al. Improving various offline techniques used for handwritten character recognition: a review
CN112766082B (en) Chinese text handwriting identification method and device based on macro-micro characteristics and storage medium
CN114758340A (en) Intelligent identification method, device and equipment for logistics address and storage medium
CN112434145A (en) Picture-viewing poetry method based on image recognition and natural language processing
CN113094567A (en) Malicious complaint identification method and system based on text clustering
CN113033567A (en) Oracle bone rubbing image character extraction method fusing segmentation network and generation network
CN112926670A (en) Garbage classification system and method based on transfer learning
CN117371533B (en) Method and device for generating data tag rule
CN117496531B (en) Construction method of convolution self-encoder capable of reducing Chinese character recognition resource overhead
CN112819205B (en) Method, device and system for predicting working hours
CN114548325B (en) Zero sample relation extraction method and system based on dual contrast learning
Alginahi Computer analysis of composite documents with non-uniform background.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant