CN114638219A - Intelligent wrong word recognition method based on machine learning algorithm - Google Patents

Intelligent wrong word recognition method based on machine learning algorithm Download PDF

Info

Publication number
CN114638219A
CN114638219A CN202210137942.9A CN202210137942A CN114638219A CN 114638219 A CN114638219 A CN 114638219A CN 202210137942 A CN202210137942 A CN 202210137942A CN 114638219 A CN114638219 A CN 114638219A
Authority
CN
China
Prior art keywords
words
wrong
wrong word
word
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210137942.9A
Other languages
Chinese (zh)
Inventor
赖贵全
唐宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Yida Shuan Technology Co ltd
Original Assignee
Chengdu Yida Shuan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Yida Shuan Technology Co ltd filed Critical Chengdu Yida Shuan Technology Co ltd
Priority to CN202210137942.9A priority Critical patent/CN114638219A/en
Publication of CN114638219A publication Critical patent/CN114638219A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses an intelligent wrong word recognition method based on a machine learning algorithm, belongs to the technical field of intelligent information recognition, and solves the problem that analysis errors exist in wrong word recognition due to the fact that a wrong word recognition technology cannot automatically update a wrong word library, and the method comprises the following steps: (1) establishing a wrong word recognition management system on an application platform; (2) the wrong word recognition management system classifies the wrongly written characters in the information application; (3) the wrong word recognition management system is connected with the server to establish a wrong word library and learn and record words in the wrong word library; (4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library; (5) after the artificial intelligence identification, the alarm is given for artificial verification and modification. The invention is used for carrying out high intelligent identification on the information manuscript in each application.

Description

Intelligent wrong word recognition method based on machine learning algorithm
Technical Field
The invention belongs to the technical field of intelligent information identification, and particularly relates to an intelligent wrong word identification method based on a machine learning algorithm.
Background
Character recognition is a technology for automatically recognizing characters by using a computer, and is an important field of pattern recognition application. People need to process a large amount of words, reports and texts in production and life. In order to reduce the labor and improve the processing efficiency, the 50 s began to explore the general character recognition method and developed an optical character recognizer. In the 60 s, utility machines using magnetic ink and special fonts were introduced. In the later 60 s, a plurality of character types and handwritten character recognition machines appeared, and the recognition precision and the machine performance of the character recognition machines can basically meet the requirements. Such as a handwritten form number recognition machine and a printed form english number recognition machine for letter sorting. In the 70 s, the basic theory of character recognition and the development of high-performance character recognition machines were mainly studied, and the research of character recognition was emphasized.
The character recognition generally includes several parts, such as collection of character information, analysis and processing of information, classification and discrimination of information, and the like.
The information collection is to convert the gray scale of the characters on the paper surface into electric signals and input the electric signals into a computer. The information collection is realized by a paper feeding mechanism and a photoelectric conversion device in the character recognition machine, and the photoelectric conversion device comprises a flying spot scanning device, a camera, a photosensitive element, a laser scanning device and the like.
The information analysis and processing is to eliminate various noises and interferences caused by printing quality, paper quality (uniformity, stain, etc.) or writing tools, etc., and to perform various normalization processes such as size, deflection, shading, thickness, etc., on the converted electric signals.
The information classification and judgment is to classify and judge the normalized text information after the noise is removed so as to output a recognition result.
At present, some wrongly-written characters recognition technologies in various APP or background management systems are matched through words in fuzzy query articles. The technology cannot automatically add a wrong word bank and intelligently identify the emotional trend of the article, so that the use scene of the sensitive words cannot be analyzed wrongly.
Disclosure of Invention
The invention aims to:
the intelligent wrong word identification method based on the machine learning algorithm is provided for solving the problem that analysis errors exist in wrong word identification due to the fact that a wrong word library cannot be automatically updated by wrong word identification technologies in various kinds of APP in the prior art.
The technical scheme adopted by the invention is as follows:
an intelligent wrong word recognition method based on a machine learning algorithm comprises the following steps:
(1) establishing a wrong word recognition management system on the application platform, wherein the wrong word recognition management system is used for recognizing and managing wrong words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with a server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
Further, in the step (4), an artificial intelligence algorithm is adopted to analyze a scene in which commonly used words in a historical manuscript library are associated and used together, when the scene meets the use times meeting the requirement of the fixed words, the words are learned and recorded, and when the manuscript is subjected to wrong word recognition, a wrong word recognition management system recognizes words which are overlapped with the words of the fixed words and have a difference part, a wrong alarm is given.
Further, the method for performing the machine learning analysis of the error words by adopting the decision tree algorithm comprises the following steps:
a. and (3) generating a decision tree: generating a decision tree by a wrong word sample set, wherein the wrong word sample set is a data set which has history according to actual needs, has a certain degree of integration and is used for data analysis and processing;
b. pruning of the decision tree: and (4) checking, correcting and modifying the decision tree generated at the last stage, and checking a preliminary rule generated in the decision tree generation process by using data in the new error word sample data set so as to cut branches influencing the accuracy of pre-balance.
Further, after the error word recognition management system recognizes and gives an alarm, the recognized error words are classified and uploaded to the error word library of the server, the error word library is updated, and a neural network algorithm is adopted for a new round of learning and updating.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
1. on the basis of the prior art, the invention adopts an artificial intelligence recognition technology to continuously learn, update and recognize the wrong word and word lexicon, realizes the automatic addition of the content of the lexicon, and intelligently recognizes the emotional tendency of an article, thereby achieving the effect of more precise warning of the wrong words and words by analyzing the scene errors of the use of sensitive words.
2. The method is different from the fuzzy word matching and identifying method, adopts the accurate word use scene analysis and identification method, realizes the error correction effect that the word can be identified when the use scene is wrong even if the word is edited correctly, greatly improves the intelligent degree of the wrong word identifying method, and provides convenience for word workers.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
An intelligent wrong word recognition method based on a machine learning algorithm comprises the following steps:
(1) establishing an error word recognition management system on the application platform, wherein the error word recognition management system is used for recognizing and managing error words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with the server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
After the error word recognition management system recognizes and gives an alarm, the recognized error words are classified and uploaded to an error word library of the server, the error word library is updated, and a neural network algorithm is adopted for a new round of learning and updating.
Compared with the prior art, the method and the device have the advantages that the high-level error characteristics can be automatically learned through the neural network learned through a machine without manually defining a training set by recognizing the characteristics with higher levels and more abstract characteristics, so that the wrong words can be intelligently recognized, and the error correction rate is greatly improved.
In the step (4), a scene that common words in a historical draft library are associated and used is analyzed by adopting an artificial intelligence algorithm, when the frequency of use of the common words meets the requirement of a fixed word is reached, the words are learned and recorded, and when the wrong words are identified on manuscripts, a wrong word identification management system identifies words which are partially overlapped with the words of the fixed words and have a difference part, a wrong alarm is given.
Different from a fuzzy word matching and identifying method, the method adopts an accurate word use scene analysis and identification method, realizes the error correction effect that the words can be identified when the use scene is wrong even if the words are edited correctly, and greatly improves the intelligent degree of the wrong word identifying method.
Preferably, the method for performing the wrong word machine learning analysis by using the decision tree algorithm comprises the following steps:
a. and (3) generating a decision tree: generating a decision tree by a wrong word sample set, wherein the wrong word sample set is a data set which has history according to actual needs, has a certain degree of integration and is used for data analysis and processing;
b. pruning of the decision tree: and (4) checking, correcting and modifying the decision tree generated at the last stage, and checking a preliminary rule generated in the decision tree generation process by using data in the new error word sample data set so as to cut branches influencing the accuracy of pre-balance.
Example 2
An intelligent wrong word recognition method based on a machine learning algorithm comprises the following steps:
(1) establishing an error word recognition management system on the application platform, wherein the error word recognition management system is used for recognizing and managing error words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with the server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
After the error word recognition management system recognizes and gives an alarm, the recognized error words are classified and uploaded to an error word library of the server, the error word library is updated, and a neural network algorithm is adopted for a new round of learning and updating.
In the step (4), a scene that common words in a historical draft library are associated and used is analyzed by adopting an artificial intelligence algorithm, when the frequency of use of the common words meets the requirement of a fixed word is reached, the words are learned and recorded, and when the wrong words are identified on manuscripts, a wrong word identification management system identifies words which are partially overlapped with the words of the fixed words and have a difference part, a wrong alarm is given.
Preferably, the error word machine learning analysis is performed by adopting a naive Bayes algorithm, and the naive Bayes algorithm is a classification algorithm. It is not a single algorithm, but a series of algorithms, all of which have a common principle, i.e. each feature being classified is independent of the value of any other feature.
The Bayesian method is characterized by combining the prior probability and the posterior probability, thereby avoiding the subjective bias of only using the prior probability and avoiding the over-fitting phenomenon of singly using the sample information. The Bayesian classification algorithm shows higher accuracy under the condition of larger data set, and the algorithm is simpler.
The naive Bayes method is correspondingly simplified on the basis of a Bayes algorithm, namely that the attributes are mutually independent under the condition when a target value is given. That is, there is no attribute variable that has a large weight on the decision result, nor is there an attribute variable that has a small weight on the decision result. Although the simplified method reduces the classification effect of the Bayesian classification algorithm to a certain extent, in an actual application scenario, the complexity of the Bayesian method is greatly simplified.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (4)

1. An intelligent wrong word identification method based on a machine learning algorithm is characterized by comprising the following steps:
(1) establishing an error word recognition management system on the application platform, wherein the error word recognition management system is used for recognizing and managing error words of the news media information APP;
(2) the wrong word recognition management system classifies wrong words in the information application, and the wrong word classification comprises the following steps: punctuation marks, names of people, positions, digital usage and common error-prone words of similar meaning words, wherein the digital usage errors comprise case errors and digital symbol errors;
(3) the wrong word recognition management system is connected with a server to establish a wrong word bank, and words in the wrong word bank are learned and recorded, wherein the wrong word bank comprises a name word bank, a position word bank and a wrong word bank;
(4) the wrong word and phrase recognition management system collects all published manuscripts to form a historical draft library, adopts a neural network algorithm to carry out artificial intelligent recognition and learning on the historical draft library, and updates the historical draft library into a wrong word library;
(5) after the manuscripts to be checked are identified through artificial intelligence, the error word identification management system alarms the error words and reports the alarm words to artificial verification and modification.
2. The intelligent wrong word recognition method based on machine learning algorithm as claimed in claim 1, characterized in that in step (4), an artificial intelligence algorithm is adopted to analyze the scene of using the commonly used words in the historical draft library in a linked manner, when the usage times of the commonly used words meet the requirement of the fixed words are reached, the words are learned and recorded, and when the wrong word recognition is performed on the manuscript, the wrong word recognition management system recognizes the words which are overlapped with the words of the fixed words and have different parts, and a wrong alarm is given.
3. The intelligent recognition method for the error word based on the machine learning algorithm according to claim 1, characterized in that the decision tree algorithm is adopted to perform the machine learning analysis of the error word, and comprises the following steps:
a. and (3) generating a decision tree: generating a decision tree by a wrong word sample set, wherein the wrong word sample set is a data set which has history according to actual needs, has a certain degree of integration and is used for data analysis and processing;
b. pruning of the decision tree: and (4) checking, correcting and modifying the decision tree generated at the last stage, and checking a preliminary rule generated in the decision tree generation process by using data in the new error word sample data set so as to cut branches influencing the accuracy of pre-balance.
4. The intelligent recognition method for the wrong words based on the machine learning algorithm as claimed in claim 1, characterized in that after the recognition and alarm of the wrong words, the recognition management system classifies the recognized wrong words and uploads the classified wrong words to the wrongly written word library of the server, the wrongly written word library is updated, and a neural network algorithm is adopted to perform a new round of learning and updating.
CN202210137942.9A 2022-02-15 2022-02-15 Intelligent wrong word recognition method based on machine learning algorithm Pending CN114638219A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210137942.9A CN114638219A (en) 2022-02-15 2022-02-15 Intelligent wrong word recognition method based on machine learning algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210137942.9A CN114638219A (en) 2022-02-15 2022-02-15 Intelligent wrong word recognition method based on machine learning algorithm

Publications (1)

Publication Number Publication Date
CN114638219A true CN114638219A (en) 2022-06-17

Family

ID=81945836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210137942.9A Pending CN114638219A (en) 2022-02-15 2022-02-15 Intelligent wrong word recognition method based on machine learning algorithm

Country Status (1)

Country Link
CN (1) CN114638219A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167603A (en) * 2023-02-28 2023-05-26 科技日报社 Method and system for monitoring full-media full-flow content

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679036A (en) * 2017-10-12 2018-02-09 南京网数信息科技有限公司 A kind of wrong word monitoring method and system
CN108717408A (en) * 2018-05-11 2018-10-30 杭州排列科技有限公司 A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
CN111859093A (en) * 2020-07-30 2020-10-30 中国联合网络通信集团有限公司 Sensitive word processing method and device and readable storage medium
WO2021184527A1 (en) * 2020-03-19 2021-09-23 南京莱斯网信技术研究院有限公司 Intelligent excavation system for sensitive information in public opinion information
CN113673227A (en) * 2021-07-15 2021-11-19 福建拓尔通软件有限公司 Method, system, equipment and medium for correcting wrongly written characters of WEB editor

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107679036A (en) * 2017-10-12 2018-02-09 南京网数信息科技有限公司 A kind of wrong word monitoring method and system
CN108717408A (en) * 2018-05-11 2018-10-30 杭州排列科技有限公司 A kind of sensitive word method for real-time monitoring, electronic equipment, storage medium and system
WO2021184527A1 (en) * 2020-03-19 2021-09-23 南京莱斯网信技术研究院有限公司 Intelligent excavation system for sensitive information in public opinion information
CN111859093A (en) * 2020-07-30 2020-10-30 中国联合网络通信集团有限公司 Sensitive word processing method and device and readable storage medium
CN113673227A (en) * 2021-07-15 2021-11-19 福建拓尔通软件有限公司 Method, system, equipment and medium for correcting wrongly written characters of WEB editor

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116167603A (en) * 2023-02-28 2023-05-26 科技日报社 Method and system for monitoring full-media full-flow content
CN116167603B (en) * 2023-02-28 2023-09-26 科技日报社 Method and system for monitoring full-media full-flow content

Similar Documents

Publication Publication Date Title
CN110298032B (en) Text classification corpus labeling training system
CN107491435B (en) Method and device for automatically identifying user emotion based on computer
Kumar et al. Text classification performance analysis on machine learning
CN104834940A (en) Medical image inspection disease classification method based on support vector machine (SVM)
CN111274817A (en) Intelligent software cost measurement method based on natural language processing technology
TWI734085B (en) Dialogue system using intention detection ensemble learning and method thereof
CN112667806B (en) Text classification screening method using LDA
CN113761903A (en) Text screening method for high-volume high-noise spoken short text
CN110765285A (en) Multimedia information content control method and system based on visual characteristics
CN111611774A (en) Operation and maintenance operation instruction security analysis method, system and storage medium
CN113886562A (en) AI resume screening method, system, equipment and storage medium
CN114638219A (en) Intelligent wrong word recognition method based on machine learning algorithm
CN116578708A (en) Paper data name disambiguation algorithm based on graph neural network
CN111754208A (en) Automatic screening method for recruitment resumes
CN114579768A (en) Maintenance method for realizing intelligent operation and maintenance knowledge base of equipment
Cheekati et al. Telugu handwritten character recognition using deep residual learning
CN118245441A (en) Industrial and commercial digital archive management system capable of being automatically classified
CN111191033A (en) Open set classification method based on classification utility
CN117710996A (en) Data extraction, classification and storage method of unstructured table document based on deep learning
CN108595568A (en) A kind of text sentiment classification method based on very big unrelated multivariate logistic regression
Sudha Semi supervised multi text classifications for telugu documents
CN115269855B (en) Paper fine-grained multi-label labeling method and device based on pre-training encoder
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system
Khan et al. Analysis of Cursive Text Recognition Systems: A Systematic Literature Review
CN114490937A (en) Comment analysis method and device based on semantic perception

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination