CN114638219A - Intelligent wrong word recognition method based on machine learning algorithm - Google Patents
Intelligent wrong word recognition method based on machine learning algorithm Download PDFInfo
- Publication number
- CN114638219A CN114638219A CN202210137942.9A CN202210137942A CN114638219A CN 114638219 A CN114638219 A CN 114638219A CN 202210137942 A CN202210137942 A CN 202210137942A CN 114638219 A CN114638219 A CN 114638219A
- Authority
- CN
- China
- Prior art keywords
- words
- wrong
- wrong word
- word
- recognition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000010801 machine learning Methods 0.000 title claims abstract description 16
- 238000004458 analytical method Methods 0.000 claims abstract description 10
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims abstract description 10
- 238000012986 modification Methods 0.000 claims abstract description 6
- 230000004048 modification Effects 0.000 claims abstract description 6
- 238000012795 verification Methods 0.000 claims abstract description 5
- 238000003066 decision tree Methods 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 claims description 3
- 230000010354 integration Effects 0.000 claims description 3
- 238000013138 pruning Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 abstract description 6
- 230000000694 effects Effects 0.000 description 4
- 238000007635 classification algorithm Methods 0.000 description 3
- 238000012937 correction Methods 0.000 description 3
- 238000013398 bayesian method Methods 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000002996 emotional effect Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003909 pattern recognition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses an intelligent wrong word recognition method based on a machine learning algorithm, belongs to the technical field of intelligent information recognition, and solves the problem that analysis errors exist in wrong word recognition due to the fact that a wrong word recognition technology cannot automatically update a wrong word library, and the method comprises the following steps: (1) establishing a wrong word recognition management system on an application platform; (2) the wrong word recognition management system classifies the wrongly written characters in the information application; (3) the wrong word recognition management system is connected with the server to establish a wrong word library and learn and record words in the wrong word library; (4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library; (5) after the artificial intelligence identification, the alarm is given for artificial verification and modification. The invention is used for carrying out high intelligent identification on the information manuscript in each application.
Description
Technical Field
The invention belongs to the technical field of intelligent information identification, and particularly relates to an intelligent wrong word identification method based on a machine learning algorithm.
Background
Character recognition is a technology for automatically recognizing characters by using a computer, and is an important field of pattern recognition application. People need to process a large amount of words, reports and texts in production and life. In order to reduce the labor and improve the processing efficiency, the 50 s began to explore the general character recognition method and developed an optical character recognizer. In the 60 s, utility machines using magnetic ink and special fonts were introduced. In the later 60 s, a plurality of character types and handwritten character recognition machines appeared, and the recognition precision and the machine performance of the character recognition machines can basically meet the requirements. Such as a handwritten form number recognition machine and a printed form english number recognition machine for letter sorting. In the 70 s, the basic theory of character recognition and the development of high-performance character recognition machines were mainly studied, and the research of character recognition was emphasized.
The character recognition generally includes several parts, such as collection of character information, analysis and processing of information, classification and discrimination of information, and the like.
The information collection is to convert the gray scale of the characters on the paper surface into electric signals and input the electric signals into a computer. The information collection is realized by a paper feeding mechanism and a photoelectric conversion device in the character recognition machine, and the photoelectric conversion device comprises a flying spot scanning device, a camera, a photosensitive element, a laser scanning device and the like.
The information analysis and processing is to eliminate various noises and interferences caused by printing quality, paper quality (uniformity, stain, etc.) or writing tools, etc., and to perform various normalization processes such as size, deflection, shading, thickness, etc., on the converted electric signals.
The information classification and judgment is to classify and judge the normalized text information after the noise is removed so as to output a recognition result.
At present, some wrongly-written characters recognition technologies in various APP or background management systems are matched through words in fuzzy query articles. The technology cannot automatically add a wrong word bank and intelligently identify the emotional trend of the article, so that the use scene of the sensitive words cannot be analyzed wrongly.
Disclosure of Invention
The invention aims to:
the intelligent wrong word identification method based on the machine learning algorithm is provided for solving the problem that analysis errors exist in wrong word identification due to the fact that a wrong word library cannot be automatically updated by wrong word identification technologies in various kinds of APP in the prior art.
The technical scheme adopted by the invention is as follows:
an intelligent wrong word recognition method based on a machine learning algorithm comprises the following steps:
(1) establishing a wrong word recognition management system on the application platform, wherein the wrong word recognition management system is used for recognizing and managing wrong words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with a server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
Further, in the step (4), an artificial intelligence algorithm is adopted to analyze a scene in which commonly used words in a historical manuscript library are associated and used together, when the scene meets the use times meeting the requirement of the fixed words, the words are learned and recorded, and when the manuscript is subjected to wrong word recognition, a wrong word recognition management system recognizes words which are overlapped with the words of the fixed words and have a difference part, a wrong alarm is given.
Further, the method for performing the machine learning analysis of the error words by adopting the decision tree algorithm comprises the following steps:
a. and (3) generating a decision tree: generating a decision tree by a wrong word sample set, wherein the wrong word sample set is a data set which has history according to actual needs, has a certain degree of integration and is used for data analysis and processing;
b. pruning of the decision tree: and (4) checking, correcting and modifying the decision tree generated at the last stage, and checking a preliminary rule generated in the decision tree generation process by using data in the new error word sample data set so as to cut branches influencing the accuracy of pre-balance.
Further, after the error word recognition management system recognizes and gives an alarm, the recognized error words are classified and uploaded to the error word library of the server, the error word library is updated, and a neural network algorithm is adopted for a new round of learning and updating.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. on the basis of the prior art, the invention adopts an artificial intelligence recognition technology to continuously learn, update and recognize the wrong word and word lexicon, realizes the automatic addition of the content of the lexicon, and intelligently recognizes the emotional tendency of an article, thereby achieving the effect of more precise warning of the wrong words and words by analyzing the scene errors of the use of sensitive words.
2. The method is different from the fuzzy word matching and identifying method, adopts the accurate word use scene analysis and identification method, realizes the error correction effect that the word can be identified when the use scene is wrong even if the word is edited correctly, greatly improves the intelligent degree of the wrong word identifying method, and provides convenience for word workers.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
An intelligent wrong word recognition method based on a machine learning algorithm comprises the following steps:
(1) establishing an error word recognition management system on the application platform, wherein the error word recognition management system is used for recognizing and managing error words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with the server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
After the error word recognition management system recognizes and gives an alarm, the recognized error words are classified and uploaded to an error word library of the server, the error word library is updated, and a neural network algorithm is adopted for a new round of learning and updating.
Compared with the prior art, the method and the device have the advantages that the high-level error characteristics can be automatically learned through the neural network learned through a machine without manually defining a training set by recognizing the characteristics with higher levels and more abstract characteristics, so that the wrong words can be intelligently recognized, and the error correction rate is greatly improved.
In the step (4), a scene that common words in a historical draft library are associated and used is analyzed by adopting an artificial intelligence algorithm, when the frequency of use of the common words meets the requirement of a fixed word is reached, the words are learned and recorded, and when the wrong words are identified on manuscripts, a wrong word identification management system identifies words which are partially overlapped with the words of the fixed words and have a difference part, a wrong alarm is given.
Different from a fuzzy word matching and identifying method, the method adopts an accurate word use scene analysis and identification method, realizes the error correction effect that the words can be identified when the use scene is wrong even if the words are edited correctly, and greatly improves the intelligent degree of the wrong word identifying method.
Preferably, the method for performing the wrong word machine learning analysis by using the decision tree algorithm comprises the following steps:
a. and (3) generating a decision tree: generating a decision tree by a wrong word sample set, wherein the wrong word sample set is a data set which has history according to actual needs, has a certain degree of integration and is used for data analysis and processing;
b. pruning of the decision tree: and (4) checking, correcting and modifying the decision tree generated at the last stage, and checking a preliminary rule generated in the decision tree generation process by using data in the new error word sample data set so as to cut branches influencing the accuracy of pre-balance.
Example 2
An intelligent wrong word recognition method based on a machine learning algorithm comprises the following steps:
(1) establishing an error word recognition management system on the application platform, wherein the error word recognition management system is used for recognizing and managing error words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with the server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
After the error word recognition management system recognizes and gives an alarm, the recognized error words are classified and uploaded to an error word library of the server, the error word library is updated, and a neural network algorithm is adopted for a new round of learning and updating.
In the step (4), a scene that common words in a historical draft library are associated and used is analyzed by adopting an artificial intelligence algorithm, when the frequency of use of the common words meets the requirement of a fixed word is reached, the words are learned and recorded, and when the wrong words are identified on manuscripts, a wrong word identification management system identifies words which are partially overlapped with the words of the fixed words and have a difference part, a wrong alarm is given.
Preferably, the error word machine learning analysis is performed by adopting a naive Bayes algorithm, and the naive Bayes algorithm is a classification algorithm. It is not a single algorithm, but a series of algorithms, all of which have a common principle, i.e. each feature being classified is independent of the value of any other feature.
The Bayesian method is characterized by combining the prior probability and the posterior probability, thereby avoiding the subjective bias of only using the prior probability and avoiding the over-fitting phenomenon of singly using the sample information. The Bayesian classification algorithm shows higher accuracy under the condition of larger data set, and the algorithm is simpler.
The naive Bayes method is correspondingly simplified on the basis of a Bayes algorithm, namely that the attributes are mutually independent under the condition when a target value is given. That is, there is no attribute variable that has a large weight on the decision result, nor is there an attribute variable that has a small weight on the decision result. Although the simplified method reduces the classification effect of the Bayesian classification algorithm to a certain extent, in an actual application scenario, the complexity of the Bayesian method is greatly simplified.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.
Claims (4)
1. An intelligent wrong word identification method based on a machine learning algorithm is characterized by comprising the following steps:
(1) establishing an error word recognition management system on the application platform, wherein the error word recognition management system is used for recognizing and managing error words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with a server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
2. The intelligent wrong word recognition method based on machine learning algorithm as claimed in claim 1, characterized in that in step (4), an artificial intelligence algorithm is adopted to analyze the scene of using the commonly used words in the historical draft library in a linked manner, when the usage times of the commonly used words meet the requirement of the fixed words are reached, the words are learned and recorded, and when the wrong word recognition is performed on the manuscript, the wrong word recognition management system recognizes the words which are overlapped with the words of the fixed words and have different parts, and a wrong alarm is given.
3. The intelligent recognition method for the error word based on the machine learning algorithm according to claim 1, characterized in that the decision tree algorithm is adopted to perform the machine learning analysis of the error word, and comprises the following steps:
a. and (3) generating a decision tree: generating a decision tree by a wrong word sample set, wherein the wrong word sample set is a data set which has history according to actual needs, has a certain degree of integration and is used for data analysis and processing;
b. pruning of the decision tree: and (4) checking, correcting and modifying the decision tree generated at the last stage, and checking a preliminary rule generated in the decision tree generation process by using data in the new error word sample data set so as to cut branches influencing the accuracy of pre-balance.
4. The intelligent recognition method for the wrong words based on the machine learning algorithm as claimed in claim 1, characterized in that after the recognition and alarm of the wrong words, the recognition management system classifies the recognized wrong words and uploads the classified wrong words to the wrongly written word library of the server, the wrongly written word library is updated, and a neural network algorithm is adopted to perform a new round of learning and updating.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210137942.9A CN114638219A (en) | 2022-02-15 | 2022-02-15 | Intelligent wrong word recognition method based on machine learning algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210137942.9A CN114638219A (en) | 2022-02-15 | 2022-02-15 | Intelligent wrong word recognition method based on machine learning algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114638219A true CN114638219A (en) | 2022-06-17 |
Family
ID=81945836
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210137942.9A Pending CN114638219A (en) | 2022-02-15 | 2022-02-15 | Intelligent wrong word recognition method based on machine learning algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114638219A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116167603A (en) * | 2023-02-28 | 2023-05-26 | 科技日报社 | Method and system for monitoring full-media full-flow content |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679036A (en) * | 2017-10-12 | 2018-02-09 | 南京网数信息科技有限公司 | A kind of wrong word monitoring method and system |
CN108717408A (en) * | 2018-05-11 | 2018-10-30 | 杭州排列科技有限公司 | A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system |
CN111859093A (en) * | 2020-07-30 | 2020-10-30 | 中国联合网络通信集团有限公司 | Sensitive word processing method and device and readable storage medium |
WO2021184527A1 (en) * | 2020-03-19 | 2021-09-23 | 南京莱斯网信技术研究院有限公司 | Intelligent excavation system for sensitive information in public opinion information |
CN113673227A (en) * | 2021-07-15 | 2021-11-19 | 福建拓尔通软件有限公司 | Method, system, equipment and medium for correcting wrongly written characters of WEB editor |
-
2022
- 2022-02-15 CN CN202210137942.9A patent/CN114638219A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107679036A (en) * | 2017-10-12 | 2018-02-09 | 南京网数信息科技有限公司 | A kind of wrong word monitoring method and system |
CN108717408A (en) * | 2018-05-11 | 2018-10-30 | 杭州排列科技有限公司 | A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system |
WO2021184527A1 (en) * | 2020-03-19 | 2021-09-23 | 南京莱斯网信技术研究院有限公司 | Intelligent excavation system for sensitive information in public opinion information |
CN111859093A (en) * | 2020-07-30 | 2020-10-30 | 中国联合网络通信集团有限公司 | Sensitive word processing method and device and readable storage medium |
CN113673227A (en) * | 2021-07-15 | 2021-11-19 | 福建拓尔通软件有限公司 | Method, system, equipment and medium for correcting wrongly written characters of WEB editor |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116167603A (en) * | 2023-02-28 | 2023-05-26 | 科技日报社 | Method and system for monitoring full-media full-flow content |
CN116167603B (en) * | 2023-02-28 | 2023-09-26 | 科技日报社 | Method and system for monitoring full-media full-flow content |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110298032B (en) | Text classification corpus labeling training system | |
CN107491435B (en) | Method and device for automatically identifying user emotion based on computer | |
Kumar et al. | Text classification performance analysis on machine learning | |
CN104834940A (en) | Medical image inspection disease classification method based on support vector machine (SVM) | |
CN111274817A (en) | Intelligent software cost measurement method based on natural language processing technology | |
TWI734085B (en) | Dialogue system using intention detection ensemble learning and method thereof | |
CN112667806B (en) | Text classification screening method using LDA | |
CN113761903A (en) | Text screening method for high-volume high-noise spoken short text | |
CN110765285A (en) | Multimedia information content control method and system based on visual characteristics | |
CN111611774A (en) | Operation and maintenance operation instruction security analysis method, system and storage medium | |
CN113886562A (en) | AI resume screening method, system, equipment and storage medium | |
CN114638219A (en) | Intelligent wrong word recognition method based on machine learning algorithm | |
CN116578708A (en) | Paper data name disambiguation algorithm based on graph neural network | |
CN111754208A (en) | Automatic screening method for recruitment resumes | |
CN114579768A (en) | Maintenance method for realizing intelligent operation and maintenance knowledge base of equipment | |
Cheekati et al. | Telugu handwritten character recognition using deep residual learning | |
CN118245441A (en) | Industrial and commercial digital archive management system capable of being automatically classified | |
CN111191033A (en) | Open set classification method based on classification utility | |
CN117710996A (en) | Data extraction, classification and storage method of unstructured table document based on deep learning | |
CN108595568A (en) | A kind of text sentiment classification method based on very big unrelated multivariate logistic regression | |
Sudha | Semi supervised multi text classifications for telugu documents | |
CN115269855B (en) | Paper fine-grained multi-label labeling method and device based on pre-training encoder | |
CN117076455A (en) | Intelligent identification-based policy structured storage method, medium and system | |
Khan et al. | Analysis of Cursive Text Recognition Systems: A Systematic Literature Review | |
CN114490937A (en) | Comment analysis method and device based on semantic perception |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |