CN111984845B - Website wrongly written word recognition method and system - Google Patents

Website wrongly written word recognition method and system Download PDF

Info

Publication number
CN111984845B
CN111984845B CN202010826076.5A CN202010826076A CN111984845B CN 111984845 B CN111984845 B CN 111984845B CN 202010826076 A CN202010826076 A CN 202010826076A CN 111984845 B CN111984845 B CN 111984845B
Authority
CN
China
Prior art keywords
word
sentences
kenlm
training
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010826076.5A
Other languages
Chinese (zh)
Other versions
CN111984845A (en
Inventor
邬鹏程
陈可义
邹林杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Baida Wisdom Network Technology Co ltd
Original Assignee
Jiangsu Baida Wisdom Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Baida Wisdom Network Technology Co ltd filed Critical Jiangsu Baida Wisdom Network Technology Co ltd
Priority to CN202010826076.5A priority Critical patent/CN111984845B/en
Publication of CN111984845A publication Critical patent/CN111984845A/en
Application granted granted Critical
Publication of CN111984845B publication Critical patent/CN111984845B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a website mispronounced word recognition method and a system, comprising the following steps: for a specific main domain name address, crawling by using a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement, obtaining text characters according to the grabbed source codes, segmenting long sentences in the text characters into short sentences by using the rule, and segmenting the short sentences; and scoring each word by utilizing the kenlm model based on the pre-trained kenlm model, and identifying the wrong words contained in all sub-pages of the website according to the scoring result. The application is convenient to use, and a user only needs to provide the website address to be monitored, and the related sub-page mispronounced words of the website are identified according to the requirement.

Description

Website wrongly written word recognition method and system
Technical Field
The application belongs to the technical field of natural language processing by an artificial intelligent computer, and particularly relates to a website wrongly written word recognition method and system.
Background
Products currently on the market for providing wrongly written or mispronounced word recognition services include hundred degrees, internet ease and the like. The areas of concentration are different, and each area has advantages and disadvantages. In some business scenarios, only one service is adopted, and the accuracy requirement of data cannot be met. All the data are introduced, the data format is inconsistent, the data need to be cleaned, and the purchase cost is high. Currently, each large platform only provides interface services, mainly aiming at text contents, and the text length is limited. The wrongly written word identification cannot be performed on the web page, and the judgment result is affected because the web page is mixed with html code labels.
Disclosure of Invention
The application aims to solve the technical problems in the prior art and provides a website wrongly written word recognition method.
In order to achieve the technical purpose, the application adopts the following technical scheme.
On the one hand, the application provides a website mispronounced word recognition method, which comprises the following steps:
for a specific main domain name address, crawling by using a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement, obtaining text characters according to the grabbed source codes, segmenting long sentences in the text characters into short sentences by using the rule, and segmenting the short sentences;
and scoring the phrases by utilizing the kenlm model based on the pre-trained kenlm model, and identifying the wrongly-identified words contained in all sub-pages of the website according to the scoring result.
Further, the method for obtaining text characters according to the captured source codes comprises the following steps:
the distributed crawler grabs the website source code and stores the result into kafka;
the data in the kafka are washed by using logstar and then stored in an elastiscearch;
and extracting the text characters by using the regular expression to extract the page labels in the page source codes.
Further, the training method of the kenlm model comprises the following steps:
dividing the text of the training text according to the paragraphs and punctuations, dividing the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces;
respectively acquiring analysis results of sentences in the same page returned by interfaces of various manufacturers, and respectively integrating the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
Still further, the training method of the kenlm model further includes retraining the misword integration model according to a set time, and the specific retraining method includes:
correcting the error of the training text based on analysis results of sentences in the same page returned by interfaces of all manufacturers, adding the corrected text results into a training corpus, and retraining kenlm based on the corrected training corpus to obtain a new klm model.
Further, the step of correcting the wrongly written words after identifying the wrongly written words contained in all the sub-pages of the website comprises the following steps:
according to a pre-acquired common Chinese word data set, homonym data set and confusion word data set, carrying homonym, homonym and confusion word of the identified mispronounced word into an original sentence, calculating PPL, and replacing the original sentence with the minimum value of PPL; the method for calculating PPL is as follows:
taking the logarithm from two sides
Wherein S is a sentence, w i For the ith word, p (w i ) For the probability of the i-th word,
PPL (Perplexity) is a model for evaluating the quality of a language model,
p(w i ) Is the probability of the i-th word.
Further, the same error types in analysis results of sentences in the same page returned by interfaces of various manufacturers are respectively added into homophone word data sets, homophone word data sets and confusion word data sets, and updated homophone word data sets, homophone word data sets and confusion word data sets are obtained.
In a second aspect, the present application provides a website mispronounced word recognition system, comprising the steps of: the system comprises a main domain name input module, a source code crawling module, a word segmentation module and a mispronounced character recognition module;
the main domain name input module is used for inputting a main domain name;
the source code crawling module is used for crawling a specific main domain name address by utilizing a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement;
the word segmentation module is used for obtaining text characters according to the captured source codes, segmenting long sentences in the text characters into short sentences by utilizing regularities, and segmenting the short sentences;
the mispronounced word recognition module is used for scoring the short sentences by utilizing the kenlm model based on the pre-trained kenlm model, and recognizing mispronounced words contained in all sub pages of the website according to the scoring result.
Further, the system also comprises a kenlm model training module and an interface module, wherein the kenlm model training module is used for training the kenlm model, and the interface module is used for connecting each manufacturer input interface; the training kenlm model is specifically used for calling a word segmentation module to segment the text of the training text according to the paragraphs and punctuations, segmenting the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces through the interface module;
the training kenlm model respectively acquires analysis results of sentences in the same page returned by interfaces of all manufacturers through an interface module, and respectively integrates the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
The present application also provides a computer readable storage medium, which stores machine readable instructions that, when executed by the machine, cause the machine to perform the method for identifying website misprints provided in any one of the embodiments of the first aspect.
The beneficial technical effects obtained by the application are as follows:
1. the application is convenient to use, and a user only needs to provide the website address to be monitored, and the related sub-page mispronounced words of the website are identified according to the requirement;
2. the user is helped to save the cost, the user can select the product service of the appointed manufacturer only by paying the platform detection cost, unnecessary purchase is saved without purchasing the product service, the development and use cost is saved, and the resource waste is reduced;
3. the detection accuracy is improved, the results of multiple manufacturers are summarized, false alarms are reduced, and the result accuracy is improved;
4. the distributed crawlers can specify crawling depth and crawling page url rules, so that the acquisition speed is improved; the controllable acquisition range reduces unnecessary page grabbing and page analysis, and improves the efficiency of the whole task;
5. the result returned by the platform is input into the self database and used as a data training set of the self-grinding error character-recognition algorithm, and the training set is gradually expanded along with the improvement of the task quantity, so that the accuracy of the training of the self model can be improved;
6. the multi-engine crawler ensures that pages after the pages js are loaded are grabbed, so that page source codes can be processed, and page characters after loading can be processed;
kafka+elk guarantees efficiency and accuracy of data binning.
Drawings
FIG. 1 is a full text segmentation process according to an embodiment of the present application;
FIG. 2 is a flow chart of the overall process of the present application.
Detailed Description
The application is further described below with reference to the drawings.
The first embodiment of the method for identifying the wrongly written characters of the website comprises the following steps:
for a specific main domain name address, crawling by using a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement, obtaining text characters according to the grabbed source codes, segmenting long sentences in the text characters into short sentences by using the rule, and segmenting the short sentences;
and scoring the phrases by utilizing the kenlm model based on the pre-trained kenlm model, and identifying the wrongly-identified words contained in all sub-pages of the website according to the scoring result.
The specific method for scoring phrases by using kenlm model in the embodiment includes the following steps:
Bi-gram:
complement the data at two ends for calculating the score of the head and tail words
Tri-gram:
Head-to-tail data completion
Taking the average value of the values of the corresponding positions of the Bi-gram and the Tri-gram
Median sense_score_media of sense_score is taken
The absolute value of the difference between each score in the sense _ score and the sense _ score _ mean is calculated,
obtaining absolute value data margin_mean of the fractional difference
margin_median=[|sentence_score[1]-sentence_score_median|,
|sentence_score[2]-sentence_score_median|,...,
|sentence_score[n]-sentence_score_median|]
Median_abs_displacement for obtaining absolute value of difference
y_score=0.6745×margin_median÷median_abs_deviation
Annotating
n-gram, wherein the occurrence of a word depends on the n-1 word words of the previous occurrence surface, and is called n-gram;
n=2 is Bi-gram bigram;
n=3 is Tri-gram ternary model;
the probability of occurrence of sentence S is a probability composition of occurrence of n words,
P(S)=P(w 1 w 2 w 3 ...w n )=P(w 1 )P(w 2 |w 1 )P(w 3 |w 1 w 2 )...P(wn|w 1 w 2 ...w n-1 )
according to the Markov of the horse, the probability of a word appearing depends on the number of words appearing in front of it,
formula shorthand P (S) ≡p (w) 1 )P(w 2 |w 1 )P(w 3 |w 2 )...P(w n |w n-1 )
S(w i-1 w i ) Score of ith word in Bi-gram
average score of i-th word and previous word
S(w i-2 w i-1 w i ) Scoring of the ith word in a Tri-gram
avgscore_tri_gram i-th word and first two words and first two minutes
Average score of the sense_score, avgscore_bi_gram and avgscore_tri_gram
sense_score_media: the median of sense_score
margin_media difference per number of margin_score_media differences in the margin_score
median_abs_displacement of margin_mean
The score of the word is obtained through the method, the word with the score larger than the threshold value is identified as the wrongly written word, and optionally, the identified wrongly written word is put in storage and data display is carried out.
In the specific embodiment, the distributed crawler grabs the website source code, stores the result into kafka and stores the result into the elastiscearch by using logstar, wherein kafka is used for peak clipping, reduces the data processing pressure and the warehousing pressure in high concurrency, and the logstar is used for data cleaning and transmission, and transmits the data in kafka to the elastiscearch after preliminary cleaning; removing labels in page source codes by using regular expressions, beau text and the like, and extracting text; the eautifolso is used for directly taking out the label from the source code, obtaining text data and further removing the label by using the regular expression.
The word segmentation of the obtained text is realized by adopting the prior art, such as jieba word segmentation, and the prior Chinese word segmentation tool is not repeated in the application.
The terms appearing are explained as follows:
kafka is an open source stream processing platform developed by the Apache software Foundation, written by Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action flow data for consumers in a web site.
Logstar is a free and open server-side data processing pipeline that is able to collect data from multiple sources, convert the data, and then send the data to the "repository" that you prefer.
The elastiscearch is a distributed, high-expansion, high-real-time search and data analysis engine;
beautifuge Soup is a Python library that can extract data from HTML or XML files;
kenlm, a statistical language model tool, has high speed and high memory efficiency, and can use a multi-core processor.
An embodiment II, based on the embodiment I, includes training a kenlm model, and specifically includes the following steps: dividing the text of the training text according to the paragraphs and punctuations, dividing the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces;
respectively acquiring analysis results of sentences in the same page returned by interfaces of various manufacturers, and respectively integrating the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
Based on the kenlm model, a Chinese common word data set, a homonym data set and a kenlm model are obtained, a long sentence is segmented into short sentences by using regularization, whether words of an confusion set are contained or not is checked, if yes, a suspected error dictionary (the type is the confusion type) is added, after the words are segmented, whether the words are in the common word data set or not is checked, if not, the suspected error dictionary is added, and the suspected error words in the sentence are detected by using the kenlm model according to the following formula.
The training text obtaining method comprises the following steps: the distributed crawler grabs a website source code, stores a result into kafka and stores the result into an elastiscearch by using logstar, wherein kafka is used for peak clipping, reduces the pressure of data processing and the pressure of warehousing in high concurrency, and the logstar is used for data cleaning and transmission, and transmits the data in the kafka to the elastiscearch after preliminary cleaning;
and removing labels in the page source codes by using regular expressions, such as Beau text, extracting text characters, directly taking out the labels from the source codes by using the Beau text, obtaining text data, and further removing the labels by using the regular expressions.
According to the interface document of the manufacturer, the text is segmented according to paragraphs, semantics and context semantics, page characters are segmented into sentences meeting the requirements of interface length, and the sentences are distributed to interfaces of all manufacturers.
For example, three years yellow, the balance of the work is towards Beijing operator, jiluochuan. Ancient people had the mind of water, known as the Mifei. Sense Song Yudui Chuwang Shen Zhi Shi. The words are as follows:
the rest is Beijing area, and the rest is Guidongfan. The back of the body is Yique, the more yuan, the more the channel is through the valley, the more the mountain is the mountain. Day, western, and vehicle was run as Ma Fan. Herba asari Forbesii, herba Bulbophylli Pelargonii field, herba Hedyotidis Diffusae Yang Lin, and herba Hedyotidis Diffusae . Thus, the spirit is moved to be dready and the mind is neglected to be dispersed. The person can not observe the face while leaning forward, see a beauty at the rock side. The person is reported as: "there is meet? Is someone s? If this is also true-! "Imperial pair: the spirit of the minister Wen Heluo is named as the MIXUE. What is seen in the natural monarch, is no? What are their shapes? Ministerial wish to smell. "
Vendor 1 interface limit length is 30:
if the text is divided according to paragraphs, the text of the paragraph is only one paragraph and the number of the words exceeds the interface limit.
According to punctuation, the independent meaning of sentences of length 15,15,16,4,10,16,10,26,12,20,9,21,5,36 respectively are sequentially arranged to 4,5,9,10,10,12,15,15,16,16,20,21,26,36, and are recombined to obtain (26, 4), (21, 9), (20, 10), (16, 10); the word count limit has been exceeded for the last length 36, and the sentence is complete in terms of semantics and has a contextual dependency, the text is analyzed separately, multiple phrases are included, the phrases are split, and then the interface is invoked.
For example, vendor 1 restricts the maximum number of words requested by the interface to 50 words, vendor 2 restricts the maximum number of words requested by the interface to 100, and a certain page source code contains a text length of 1000 words. Then the manufacturer 1 interface is called to divide the text into 20 sentences of length 50 (assuming that the 50 word sentence is a complete sentence, the number of words must be smaller than this value, and the number of divided sentences is also greater than 20).
For cleaning wrongly written word formats in pages determined by each manufacturer, the data formats returned by each manufacturer are inconsistent, and the returned results for different manufacturers need to be converted into the same format by different methods, for example
Hundred degrees return:
another vendor returns:
the warehouse entry format in the specific implementation is
The data can be guaranteed to be put in a unified warehouse only by cleaning the data to conform to the format, the detailed cleaning process depends on different return formats, and the cleaning of the wrongly written word formats in the pages determined by each manufacturer is a basic and indispensable process.
The method is characterized in that the length of the text extracted from the page source code is far greater than the length limit of the interface, so that the text of one page needs to be divided into a plurality of sections of characters to tune the interface of the same manufacturer, the position of the wrongly written word in the data returned by each interface is the position of the section of character aiming at the calling interface instead of the position of the wrongly written word in the whole text, the wrongly written word position needs to be recalculated, the above cleaning problem is also related, punctuation points are ignored when the manufacturer calculates the character position (I love you, china (China position is 3-4)), some manufacturers calculate the punctuation points (I love you, china (China position is 4-5)), and some manufacturers start from 1. For example, in the above example, the sentence is split into 20 sentences in the vendor 1, and then the vendor interfaces of the 20 sentences are integrated to obtain the analysis results of the vendor for all texts of the whole page, where the integration of the results of the vendor is not to say that all the vendor results are integrated, as shown in the flow chart 2.
In a specific embodiment, the method further comprises retraining the kenlm model at a set time, and the retraining method specifically comprises the following steps: correcting the error of the training text based on analysis results of sentences in the same page returned by interfaces of all manufacturers, adding the corrected text results into a training corpus, and retraining kenlm based on the corrected training corpus to obtain a new klm model.
In a specific embodiment, the identification result of each manufacturer on the same page is returned through an interface, and the returned results of sentences in the same page are integrated to obtain wrongly written words of the whole page; and integrating all manufacturer results, and adding the same error types in the manufacturer results into the homophones, homophones and other data sets.
Optionally, the returned results of each manufacturer are subjected to format arrangement and put in storage for subsequent result display.
Based on the kenlm model, a Chinese common word data set, a homonym data set and a kenlm model are obtained, a long sentence is segmented into short sentences by using regularization, whether words of an confusion set are contained or not is checked, if yes, a suspected error dictionary (the type is the confusion type) is added, after the words are segmented, whether the words are in the common word data set or not is checked, if not, the suspected error dictionary is added, and the suspected error words in the sentence are detected by using the kenlm model according to the following formula.
Bi-gram:
Complement the data at two ends for calculating the score of the head and tail words
Tri-gram:
Head-to-tail data completion
Taking the average value of the values of the corresponding positions of the Bi-gram and the Tri-gram
Median sense_score_media of sense_score is taken
The absolute value of the difference between each score in the sense _ score and the sense _ score _ mean is calculated,
obtaining absolute value data margin_mean of the fractional difference
margin_median=[|sentence_score[1]-sentence_score_median|,
|sentence_score[2]-sentence_score_median|,...,
|sentence_score[n]-sentence_score_median|]
Median_abs_displacement for obtaining absolute value of difference
y_score=0.6745×margin_median÷median_abs_deviation
Through the method, the Chinese common word data set, the homonym data set and the confusion word data set can be obtained while the webpage wrongly written words are identified, the data set is input into the self database and used as a data training set of a self-grinding wrongly written word algorithm, and the training set is gradually expanded along with the improvement of the task quantity, so that the accuracy of self model training can be improved, and the development cost of a sales recycling platform can be used as a professional data set.
The method provided by the application is suitable for all main stream wrongly written word retrieval tools, and the wrongly written word recognition accuracy is improved through an autonomous algorithm; and through integrating the error word recognition interfaces of the main stream in the market, data cleaning and integration are carried out on the respective data analysis results, the safety of user data is ensured, and only the comprehensive analysis results are returned to the user.
The application can realize the accurate extraction of web page characters, and the automatic splitting according to different word number limits of each mainstream platform, and dynamically integrate the split results. The application further displays the results of each manufacturer and the recognition results of the method provided by the application, thereby realizing the summary display of website analysis results aiming at customized users.
In a third embodiment, the step of correcting the wrongly written words after identifying the wrongly written words included in all the sub-pages of the website includes the following steps:
according to a latest Chinese common word data set, a homonym data set and a confusion word data set which are acquired in advance, carrying homonyms and homonyms of the identified wrongly-identified words into an original sentence, calculating PPL, and replacing the original sentence with the minimum value of PPL; the method for calculating PPL is as follows:
taking the logarithm from two sides
Wherein S is a sentence, w i For the ith word, p (w i ) For the probability of the i-th word,
PPL (Perplexity) is a model for evaluating the quality of a language model,
p(w i ) Is the probability of the i-th word.
ppl is error corrected based on the above kenlm model recognition error-prone words. The method is to obtain all homonyms, homonyms and error-prone words of the wrongly written word, then bring the words into an original sentence, respectively calculate ppl of the sentences after the words are replaced, wherein the ppl is the word with the minimum error correction, and the probability of each word is obtained by a klm model.
An embodiment four, a method corresponding to the website mispronounced word recognition method provided in the above embodiment, where the embodiment provides a website mispronounced word recognition system, including the following steps: the system comprises a main domain name input module, a source code crawling module, a word segmentation module and a mispronounced character recognition module;
the main domain name input module is used for inputting a main domain name;
the source code crawling module is used for crawling a specific main domain name address by utilizing a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement;
the word segmentation module is used for obtaining text characters according to the captured source codes, segmenting long sentences in the text characters into short sentences by utilizing regularities, and segmenting the short sentences;
the mispronounced word recognition module is used for scoring the short sentences by utilizing the kenlm model based on the pre-trained kenlm model, and recognizing mispronounced words contained in all sub pages of the website according to the scoring result.
Further, on the basis of the fourth embodiment, the system further includes a kenlm model training module and an interface module, where the kenlm model training module is used to train the kenlm model, and the interface module is used to connect with each vendor input interface; the training kenlm model is specifically used for calling a word segmentation module to segment the text of the training text according to the paragraphs and punctuations, segmenting the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces through the interface module;
the training kenlm model respectively acquires analysis results of sentences in the same page returned by interfaces of all manufacturers through an interface module, and respectively integrates the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
It should be noted that, in this embodiment, specific implementations of each module are in one-to-one correspondence with the above embodiments, and will not be described in detail in this embodiment.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims (9)

1. The website wrongly written word recognition method is characterized by comprising the following steps of:
for a specific main domain name address, crawling by using a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement, obtaining text characters according to the grabbed source codes, segmenting long sentences in the text characters into short sentences by using the rule, and segmenting the short sentences;
based on a pre-trained kenlm model, scoring each word by using the kenlm model, and identifying the wrong words contained in all sub-pages of the website according to the scoring result;
the training method of the kenlm model comprises the following steps:
dividing the text of the training text according to the paragraphs and punctuations, dividing the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces;
respectively acquiring analysis results of sentences in the same page returned by interfaces of various manufacturers, and respectively integrating the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
2. The web site misplacement word recognition method as set forth in claim 1, wherein the method for obtaining text words from the grasped source code comprises the steps of:
the distributed crawler grabs the website source code and stores the result into kafka;
the data in the kafka are washed by using logstar and then stored in an elastiscearch;
and extracting the text characters by using the regular expression to extract the page labels in the page source codes.
3. The method for identifying mispronounced words in a website according to claim 1, wherein the training method of the kenlm model further comprises retraining the mispronounced word integration model according to a set time, and the specific retraining method comprises:
correcting the error of the training text based on analysis results of sentences in the same page returned by interfaces of various manufacturers, adding the corrected text results into a training corpus, and retraining the kenlm based on the corrected training corpus to obtain a new kenlm model.
4. The method for identifying mispronounced words of a website according to claim 1, wherein the step of correcting the mispronounced words after identifying the mispronounced words contained in all sub pages of the website comprises the steps of:
according to a pre-acquired common Chinese word data set, homonym data set and confusion word data set, carrying homonym, homonym and confusion word of the identified mispronounced word into an original sentence, calculating PPL, and replacing the original sentence with the minimum value of PPL; the method for calculating PPL is as follows:
taking the logarithm from two sides
Wherein S is a sentence, w i For the ith word, p (w i ) The PPL (Perplexity) is the probability of the i-th word and is used for evaluating the quality of the language model.
5. The web site mispronounced word recognition method as set forth in claim 1, wherein the same error type in the analysis result of the sentences in the same page returned by the interfaces of the respective manufacturers is added to homonym data sets and homonym data sets respectively to obtain updated homonym data sets and homonym data sets.
6. The web site mispronounced word recognition method as claimed in claim 1, wherein long sentences are segmented into short sentences by using regularization, judging whether words of an confusion set are contained or not, if yes, adding a suspected error dictionary, and the type confusion type; after word segmentation, checking whether the words are in the common word data set, and if not, adding a suspected error dictionary.
7. The website wrongly written and wrongly written word recognition system is characterized by comprising the following steps: the system comprises a main domain name input module, a source code crawling module, a word segmentation module and a mispronounced character recognition module;
the main domain name input module is used for inputting a main domain name;
the source code crawling module is used for crawling a specific main domain name address by utilizing a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement;
the word segmentation module is used for obtaining text characters according to the captured source codes, segmenting long sentences in the text characters into short sentences by utilizing regularities, and segmenting the short sentences;
the mispronounced word recognition module is used for scoring the short sentences by utilizing the kenlm model based on the pre-trained kenlm model, and recognizing mispronounced words contained in all sub pages of the website according to the scoring result;
the training method of the kenlm model comprises the following steps:
dividing the text of the training text according to the paragraphs and punctuations, dividing the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces;
respectively acquiring analysis results of sentences in the same page returned by interfaces of various manufacturers, and respectively integrating the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
8. The web site misprinted word recognition system of claim 7, further comprising a kenlm model training module for training a kenlm model and an interface module for interfacing with various vendor input interfaces; the training kenlm model is specifically used for calling a word segmentation module to segment the text of the training text according to the paragraphs and punctuations, segmenting the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces through the interface module;
the training kenlm model respectively acquires analysis results of sentences in the same page returned by interfaces of all manufacturers through an interface module, and respectively integrates the analysis results to obtain wrongly written words of the whole page;
and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.
9. A computer readable storage medium having stored thereon machine readable instructions which when executed by the machine cause the machine to perform the web site misplacement word recognition method of any one of the preceding claims 1 to 6.
CN202010826076.5A 2020-08-17 2020-08-17 Website wrongly written word recognition method and system Active CN111984845B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010826076.5A CN111984845B (en) 2020-08-17 2020-08-17 Website wrongly written word recognition method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010826076.5A CN111984845B (en) 2020-08-17 2020-08-17 Website wrongly written word recognition method and system

Publications (2)

Publication Number Publication Date
CN111984845A CN111984845A (en) 2020-11-24
CN111984845B true CN111984845B (en) 2023-10-31

Family

ID=73435293

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010826076.5A Active CN111984845B (en) 2020-08-17 2020-08-17 Website wrongly written word recognition method and system

Country Status (1)

Country Link
CN (1) CN111984845B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112883717A (en) * 2021-04-27 2021-06-01 北京嘉和海森健康科技有限公司 Wrongly written character detection method and device
CN114387602B (en) * 2022-03-24 2022-07-08 北京智源人工智能研究院 Medical OCR data optimization model training method, optimization method and equipment
CN115146636A (en) * 2022-09-05 2022-10-04 华东交通大学 Method, system and storage medium for correcting errors of Chinese wrongly written characters

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN107679036A (en) * 2017-10-12 2018-02-09 南京网数信息科技有限公司 A kind of wrong word monitoring method and system
CN109408331A (en) * 2018-10-15 2019-03-01 四川长虹电器股份有限公司 Log alarming system based on user individual feature
CN109992769A (en) * 2018-12-06 2019-07-09 平安科技(深圳)有限公司 Sentence reasonability judgment method, device, computer equipment based on semanteme parsing
CN110276077A (en) * 2019-06-25 2019-09-24 上海应用技术大学 The method, device and equipment of Chinese error correction
CN110717031A (en) * 2019-10-15 2020-01-21 南京摄星智能科技有限公司 Intelligent conference summary generation method and system
CN110717327A (en) * 2019-09-29 2020-01-21 北京百度网讯科技有限公司 Title generation method and device, electronic equipment and storage medium
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN110929098A (en) * 2019-11-14 2020-03-27 腾讯科技(深圳)有限公司 Video data processing method and device, electronic equipment and storage medium
CN111026942A (en) * 2019-11-01 2020-04-17 平安科技(深圳)有限公司 Hot word extraction method, device, terminal and medium based on web crawler
CN111414749A (en) * 2020-03-18 2020-07-14 哈尔滨理工大学 Social text dependency syntactic analysis system based on deep neural network
CN111414735A (en) * 2020-03-11 2020-07-14 北京明略软件系统有限公司 Text data generation method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9870355B2 (en) * 2015-07-17 2018-01-16 Ebay Inc. Correction of user input

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102332028A (en) * 2011-10-15 2012-01-25 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN107679036A (en) * 2017-10-12 2018-02-09 南京网数信息科技有限公司 A kind of wrong word monitoring method and system
CN109408331A (en) * 2018-10-15 2019-03-01 四川长虹电器股份有限公司 Log alarming system based on user individual feature
CN109992769A (en) * 2018-12-06 2019-07-09 平安科技(深圳)有限公司 Sentence reasonability judgment method, device, computer equipment based on semanteme parsing
CN110276077A (en) * 2019-06-25 2019-09-24 上海应用技术大学 The method, device and equipment of Chinese error correction
CN110852087A (en) * 2019-09-23 2020-02-28 腾讯科技(深圳)有限公司 Chinese error correction method and device, storage medium and electronic device
CN110717327A (en) * 2019-09-29 2020-01-21 北京百度网讯科技有限公司 Title generation method and device, electronic equipment and storage medium
CN110717031A (en) * 2019-10-15 2020-01-21 南京摄星智能科技有限公司 Intelligent conference summary generation method and system
CN111026942A (en) * 2019-11-01 2020-04-17 平安科技(深圳)有限公司 Hot word extraction method, device, terminal and medium based on web crawler
CN110929098A (en) * 2019-11-14 2020-03-27 腾讯科技(深圳)有限公司 Video data processing method and device, electronic equipment and storage medium
CN111414735A (en) * 2020-03-11 2020-07-14 北京明略软件系统有限公司 Text data generation method and device
CN111414749A (en) * 2020-03-18 2020-07-14 哈尔滨理工大学 Social text dependency syntactic analysis system based on deep neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Stochastic optimization of program obfuscation;H. Liu 等;《2017 IEEE/ACM 39th International Conference on Software Engineering》;221-231 *
基于多维语义关系的谐音双关语识别模型;徐琳宏 等;《中国科学:信息科学》;第48卷(第11期);1510-1520 *
政府网站文本校对关键技术研究;袁志;《中国优秀硕士学位论文全文数据库信息科技辑》(第(2018)8期);I138-974 *

Also Published As

Publication number Publication date
CN111984845A (en) 2020-11-24

Similar Documents

Publication Publication Date Title
CN111984845B (en) Website wrongly written word recognition method and system
KR102316063B1 (en) Method and apparatus for identifying key phrase in audio data, device and medium
CN111597351A (en) Visual document map construction method
CN110569335B (en) Triple verification method and device based on artificial intelligence and storage medium
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
WO2020233386A1 (en) Intelligent question-answering method and device employing aiml, computer apparatus, and storage medium
US10452785B2 (en) Translation assistance system, translation assistance method and translation assistance program
CN113505209A (en) Intelligent question-answering system for automobile field
CN109213998B (en) Chinese character error detection method and system
CN106547743B (en) Translation method and system
CN110674378A (en) Chinese semantic recognition method based on cosine similarity and minimum editing distance
CN112231451B (en) Reference word recovery method and device, conversation robot and storage medium
CN112101003B (en) Sentence text segmentation method, device and equipment and computer readable storage medium
CN101308512B (en) Mutual translation pair extraction method and device based on web page
CN115017268B (en) Heuristic log extraction method and system based on tree structure
CN112131341A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN107797986B (en) LSTM-CNN-based mixed corpus word segmentation method
CN110929518B (en) Text sequence labeling algorithm using overlapping splitting rule
CN110334362B (en) Method for solving and generating untranslated words based on medical neural machine translation
CN110413972B (en) Intelligent table name field name complementing method based on NLP technology
CN111737424A (en) Question matching method, device, equipment and storage medium
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN111680146A (en) Method and device for determining new words, electronic equipment and readable storage medium
CN111368547A (en) Entity identification method, device, equipment and storage medium based on semantic analysis
CN111414459B (en) Character relationship acquisition method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant