CN111984845B

CN111984845B - Website wrongly written word recognition method and system

Info

Publication number: CN111984845B
Application number: CN202010826076.5A
Authority: CN
Inventors: 邬鹏程; 陈可义; 邹林杰
Original assignee: Jiangsu Baida Wisdom Network Technology Co ltd
Current assignee: Jiangsu Baida Wisdom Network Technology Co ltd
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2023-10-31
Anticipated expiration: 2040-08-17
Also published as: CN111984845A

Abstract

The application provides a website mispronounced word recognition method and a system, comprising the following steps: for a specific main domain name address, crawling by using a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement, obtaining text characters according to the grabbed source codes, segmenting long sentences in the text characters into short sentences by using the rule, and segmenting the short sentences; and scoring each word by utilizing the kenlm model based on the pre-trained kenlm model, and identifying the wrong words contained in all sub-pages of the website according to the scoring result. The application is convenient to use, and a user only needs to provide the website address to be monitored, and the related sub-page mispronounced words of the website are identified according to the requirement.

Description

Website wrongly written word recognition method and system

Technical Field

The application belongs to the technical field of natural language processing by an artificial intelligent computer, and particularly relates to a website wrongly written word recognition method and system.

Background

Products currently on the market for providing wrongly written or mispronounced word recognition services include hundred degrees, internet ease and the like. The areas of concentration are different, and each area has advantages and disadvantages. In some business scenarios, only one service is adopted, and the accuracy requirement of data cannot be met. All the data are introduced, the data format is inconsistent, the data need to be cleaned, and the purchase cost is high. Currently, each large platform only provides interface services, mainly aiming at text contents, and the text length is limited. The wrongly written word identification cannot be performed on the web page, and the judgment result is affected because the web page is mixed with html code labels.

Disclosure of Invention

The application aims to solve the technical problems in the prior art and provides a website wrongly written word recognition method.

In order to achieve the technical purpose, the application adopts the following technical scheme.

On the one hand, the application provides a website mispronounced word recognition method, which comprises the following steps:

for a specific main domain name address, crawling by using a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement, obtaining text characters according to the grabbed source codes, segmenting long sentences in the text characters into short sentences by using the rule, and segmenting the short sentences;

and scoring the phrases by utilizing the kenlm model based on the pre-trained kenlm model, and identifying the wrongly-identified words contained in all sub-pages of the website according to the scoring result.

Further, the method for obtaining text characters according to the captured source codes comprises the following steps:

the distributed crawler grabs the website source code and stores the result into kafka;

the data in the kafka are washed by using logstar and then stored in an elastiscearch;

and extracting the text characters by using the regular expression to extract the page labels in the page source codes.

Further, the training method of the kenlm model comprises the following steps:

dividing the text of the training text according to the paragraphs and punctuations, dividing the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces;

respectively acquiring analysis results of sentences in the same page returned by interfaces of various manufacturers, and respectively integrating the analysis results to obtain wrongly written words of the whole page;

and cleaning the wrongly written word formats in the pages determined by each manufacturer to obtain a training corpus, and training the training corpus to obtain a kenlm model.

Still further, the training method of the kenlm model further includes retraining the misword integration model according to a set time, and the specific retraining method includes:

correcting the error of the training text based on analysis results of sentences in the same page returned by interfaces of all manufacturers, adding the corrected text results into a training corpus, and retraining kenlm based on the corrected training corpus to obtain a new klm model.

Further, the step of correcting the wrongly written words after identifying the wrongly written words contained in all the sub-pages of the website comprises the following steps:

according to a pre-acquired common Chinese word data set, homonym data set and confusion word data set, carrying homonym, homonym and confusion word of the identified mispronounced word into an original sentence, calculating PPL, and replacing the original sentence with the minimum value of PPL; the method for calculating PPL is as follows:

taking the logarithm from two sides

Wherein S is a sentence, w _i For the ith word, p (w _i ) For the probability of the i-th word,

PPL (Perplexity) is a model for evaluating the quality of a language model,

p(w _i ) Is the probability of the i-th word.

Further, the same error types in analysis results of sentences in the same page returned by interfaces of various manufacturers are respectively added into homophone word data sets, homophone word data sets and confusion word data sets, and updated homophone word data sets, homophone word data sets and confusion word data sets are obtained.

In a second aspect, the present application provides a website mispronounced word recognition system, comprising the steps of: the system comprises a main domain name input module, a source code crawling module, a word segmentation module and a mispronounced character recognition module;

the main domain name input module is used for inputting a main domain name;

the source code crawling module is used for crawling a specific main domain name address by utilizing a distributed crawler according to a preset crawling depth and a crawling page address rule to obtain source codes of all sub pages meeting the crawling depth requirement;

the word segmentation module is used for obtaining text characters according to the captured source codes, segmenting long sentences in the text characters into short sentences by utilizing regularities, and segmenting the short sentences;

the mispronounced word recognition module is used for scoring the short sentences by utilizing the kenlm model based on the pre-trained kenlm model, and recognizing mispronounced words contained in all sub pages of the website according to the scoring result.

Further, the system also comprises a kenlm model training module and an interface module, wherein the kenlm model training module is used for training the kenlm model, and the interface module is used for connecting each manufacturer input interface; the training kenlm model is specifically used for calling a word segmentation module to segment the text of the training text according to the paragraphs and punctuations, segmenting the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces through the interface module;

the training kenlm model respectively acquires analysis results of sentences in the same page returned by interfaces of all manufacturers through an interface module, and respectively integrates the analysis results to obtain wrongly written words of the whole page;

The present application also provides a computer readable storage medium, which stores machine readable instructions that, when executed by the machine, cause the machine to perform the method for identifying website misprints provided in any one of the embodiments of the first aspect.

The beneficial technical effects obtained by the application are as follows:

1. the application is convenient to use, and a user only needs to provide the website address to be monitored, and the related sub-page mispronounced words of the website are identified according to the requirement;

2. the user is helped to save the cost, the user can select the product service of the appointed manufacturer only by paying the platform detection cost, unnecessary purchase is saved without purchasing the product service, the development and use cost is saved, and the resource waste is reduced;

3. the detection accuracy is improved, the results of multiple manufacturers are summarized, false alarms are reduced, and the result accuracy is improved;

4. the distributed crawlers can specify crawling depth and crawling page url rules, so that the acquisition speed is improved; the controllable acquisition range reduces unnecessary page grabbing and page analysis, and improves the efficiency of the whole task;

5. the result returned by the platform is input into the self database and used as a data training set of the self-grinding error character-recognition algorithm, and the training set is gradually expanded along with the improvement of the task quantity, so that the accuracy of the training of the self model can be improved;

6. the multi-engine crawler ensures that pages after the pages js are loaded are grabbed, so that page source codes can be processed, and page characters after loading can be processed;

kafka+elk guarantees efficiency and accuracy of data binning.

Drawings

FIG. 1 is a full text segmentation process according to an embodiment of the present application;

FIG. 2 is a flow chart of the overall process of the present application.

Detailed Description

The application is further described below with reference to the drawings.

The first embodiment of the method for identifying the wrongly written characters of the website comprises the following steps:

The specific method for scoring phrases by using kenlm model in the embodiment includes the following steps:

Bi-gram:

complement the data at two ends for calculating the score of the head and tail words

Tri-gram:

Head-to-tail data completion

Taking the average value of the values of the corresponding positions of the Bi-gram and the Tri-gram

Median sense_score_media of sense_score is taken

The absolute value of the difference between each score in the sense _ score and the sense _ score _ mean is calculated,

obtaining absolute value data margin_mean of the fractional difference

margin_median＝[|sentence_score[1]-sentence_score_median|,

|sentence_score[2]-sentence_score_median|,...,

|sentence_score[n]-sentence_score_median|]

Median_abs_displacement for obtaining absolute value of difference

y_score＝0.6745×margin_median÷median_abs_deviation

Annotating

n-gram, wherein the occurrence of a word depends on the n-1 word words of the previous occurrence surface, and is called n-gram;

n=2 is Bi-gram bigram;

n=3 is Tri-gram ternary model;

the probability of occurrence of sentence S is a probability composition of occurrence of n words,

P(S)＝P(w ₁ w ₂ w ₃ ...w _n )＝P(w ₁ )P(w ₂ |w ₁ )P(w ₃ |w ₁ w ₂ )...P(wn|w ₁ w ₂ ...w _n-1 )

according to the Markov of the horse, the probability of a word appearing depends on the number of words appearing in front of it,

formula shorthand P (S) ≡p (w) ₁ )P(w ₂ |w ₁ )P(w ₃ |w ₂ )...P(w _n |w _n-1 )

S(w _i-1 w _i ) Score of ith word in Bi-gram

average score of i-th word and previous word

S(w _i-2 w _i-1 w _i ) Scoring of the ith word in a Tri-gram

avgscore_tri_gram i-th word and first two words and first two minutes

Average score of the sense_score, avgscore_bi_gram and avgscore_tri_gram

sense_score_media: the median of sense_score

margin_media difference per number of margin_score_media differences in the margin_score

median_abs_displacement of margin_mean

The score of the word is obtained through the method, the word with the score larger than the threshold value is identified as the wrongly written word, and optionally, the identified wrongly written word is put in storage and data display is carried out.

In the specific embodiment, the distributed crawler grabs the website source code, stores the result into kafka and stores the result into the elastiscearch by using logstar, wherein kafka is used for peak clipping, reduces the data processing pressure and the warehousing pressure in high concurrency, and the logstar is used for data cleaning and transmission, and transmits the data in kafka to the elastiscearch after preliminary cleaning; removing labels in page source codes by using regular expressions, beau text and the like, and extracting text; the eautifolso is used for directly taking out the label from the source code, obtaining text data and further removing the label by using the regular expression.

The word segmentation of the obtained text is realized by adopting the prior art, such as jieba word segmentation, and the prior Chinese word segmentation tool is not repeated in the application.

The terms appearing are explained as follows:

kafka is an open source stream processing platform developed by the Apache software Foundation, written by Scala and Java. Kafka is a high-throughput distributed publish-subscribe messaging system that can handle all action flow data for consumers in a web site.

Logstar is a free and open server-side data processing pipeline that is able to collect data from multiple sources, convert the data, and then send the data to the "repository" that you prefer.

The elastiscearch is a distributed, high-expansion, high-real-time search and data analysis engine;

beautifuge Soup is a Python library that can extract data from HTML or XML files;

kenlm, a statistical language model tool, has high speed and high memory efficiency, and can use a multi-core processor.

An embodiment II, based on the embodiment I, includes training a kenlm model, and specifically includes the following steps: dividing the text of the training text according to the paragraphs and punctuations, dividing the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces;

Based on the kenlm model, a Chinese common word data set, a homonym data set and a kenlm model are obtained, a long sentence is segmented into short sentences by using regularization, whether words of an confusion set are contained or not is checked, if yes, a suspected error dictionary (the type is the confusion type) is added, after the words are segmented, whether the words are in the common word data set or not is checked, if not, the suspected error dictionary is added, and the suspected error words in the sentence are detected by using the kenlm model according to the following formula.

The training text obtaining method comprises the following steps: the distributed crawler grabs a website source code, stores a result into kafka and stores the result into an elastiscearch by using logstar, wherein kafka is used for peak clipping, reduces the pressure of data processing and the pressure of warehousing in high concurrency, and the logstar is used for data cleaning and transmission, and transmits the data in the kafka to the elastiscearch after preliminary cleaning;

and removing labels in the page source codes by using regular expressions, such as Beau text, extracting text characters, directly taking out the labels from the source codes by using the Beau text, obtaining text data, and further removing the labels by using the regular expressions.

According to the interface document of the manufacturer, the text is segmented according to paragraphs, semantics and context semantics, page characters are segmented into sentences meeting the requirements of interface length, and the sentences are distributed to interfaces of all manufacturers.

For example, three years yellow, the balance of the work is towards Beijing operator, jiluochuan. Ancient people had the mind of water, known as the Mifei. Sense Song Yudui Chuwang Shen Zhi Shi. The words are as follows:

the rest is Beijing area, and the rest is Guidongfan. The back of the body is Yique, the more yuan, the more the channel is through the valley, the more the mountain is the mountain. Day, western, and vehicle was run as Ma Fan. Herba asari Forbesii, herba Bulbophylli Pelargonii field, herba Hedyotidis Diffusae Yang Lin, and herba Hedyotidis Diffusae . Thus, the spirit is moved to be dready and the mind is neglected to be dispersed. The person can not observe the face while leaning forward, see a beauty at the rock side. The person is reported as: "there is meet? Is someone s? If this is also true-! "Imperial pair: the spirit of the minister Wen Heluo is named as the MIXUE. What is seen in the natural monarch, is no? What are their shapes? Ministerial wish to smell. "

Vendor 1 interface limit length is 30:

if the text is divided according to paragraphs, the text of the paragraph is only one paragraph and the number of the words exceeds the interface limit.

According to punctuation, the independent meaning of sentences of length 15,15,16,4,10,16,10,26,12,20,9,21,5,36 respectively are sequentially arranged to 4,5,9,10,10,12,15,15,16,16,20,21,26,36, and are recombined to obtain (26, 4), (21, 9), (20, 10), (16, 10); the word count limit has been exceeded for the last length 36, and the sentence is complete in terms of semantics and has a contextual dependency, the text is analyzed separately, multiple phrases are included, the phrases are split, and then the interface is invoked.

For example, vendor 1 restricts the maximum number of words requested by the interface to 50 words, vendor 2 restricts the maximum number of words requested by the interface to 100, and a certain page source code contains a text length of 1000 words. Then the manufacturer 1 interface is called to divide the text into 20 sentences of length 50 (assuming that the 50 word sentence is a complete sentence, the number of words must be smaller than this value, and the number of divided sentences is also greater than 20).

For cleaning wrongly written word formats in pages determined by each manufacturer, the data formats returned by each manufacturer are inconsistent, and the returned results for different manufacturers need to be converted into the same format by different methods, for example

Hundred degrees return:

another vendor returns:

the warehouse entry format in the specific implementation is

The data can be guaranteed to be put in a unified warehouse only by cleaning the data to conform to the format, the detailed cleaning process depends on different return formats, and the cleaning of the wrongly written word formats in the pages determined by each manufacturer is a basic and indispensable process.

The method is characterized in that the length of the text extracted from the page source code is far greater than the length limit of the interface, so that the text of one page needs to be divided into a plurality of sections of characters to tune the interface of the same manufacturer, the position of the wrongly written word in the data returned by each interface is the position of the section of character aiming at the calling interface instead of the position of the wrongly written word in the whole text, the wrongly written word position needs to be recalculated, the above cleaning problem is also related, punctuation points are ignored when the manufacturer calculates the character position (I love you, china (China position is 3-4)), some manufacturers calculate the punctuation points (I love you, china (China position is 4-5)), and some manufacturers start from 1. For example, in the above example, the sentence is split into 20 sentences in the vendor 1, and then the vendor interfaces of the 20 sentences are integrated to obtain the analysis results of the vendor for all texts of the whole page, where the integration of the results of the vendor is not to say that all the vendor results are integrated, as shown in the flow chart 2.

In a specific embodiment, the method further comprises retraining the kenlm model at a set time, and the retraining method specifically comprises the following steps: correcting the error of the training text based on analysis results of sentences in the same page returned by interfaces of all manufacturers, adding the corrected text results into a training corpus, and retraining kenlm based on the corrected training corpus to obtain a new klm model.

In a specific embodiment, the identification result of each manufacturer on the same page is returned through an interface, and the returned results of sentences in the same page are integrated to obtain wrongly written words of the whole page; and integrating all manufacturer results, and adding the same error types in the manufacturer results into the homophones, homophones and other data sets.

Optionally, the returned results of each manufacturer are subjected to format arrangement and put in storage for subsequent result display.

Bi-gram:

Tri-gram:

Head-to-tail data completion

Median sense_score_media of sense_score is taken

obtaining absolute value data margin_mean of the fractional difference

margin_median＝[|sentence_score[1]-sentence_score_median|,

|sentence_score[2]-sentence_score_median|,...,

|sentence_score[n]-sentence_score_median|]

Median_abs_displacement for obtaining absolute value of difference

y_score＝0.6745×margin_median÷median_abs_deviation

Through the method, the Chinese common word data set, the homonym data set and the confusion word data set can be obtained while the webpage wrongly written words are identified, the data set is input into the self database and used as a data training set of a self-grinding wrongly written word algorithm, and the training set is gradually expanded along with the improvement of the task quantity, so that the accuracy of self model training can be improved, and the development cost of a sales recycling platform can be used as a professional data set.

The method provided by the application is suitable for all main stream wrongly written word retrieval tools, and the wrongly written word recognition accuracy is improved through an autonomous algorithm; and through integrating the error word recognition interfaces of the main stream in the market, data cleaning and integration are carried out on the respective data analysis results, the safety of user data is ensured, and only the comprehensive analysis results are returned to the user.

The application can realize the accurate extraction of web page characters, and the automatic splitting according to different word number limits of each mainstream platform, and dynamically integrate the split results. The application further displays the results of each manufacturer and the recognition results of the method provided by the application, thereby realizing the summary display of website analysis results aiming at customized users.

In a third embodiment, the step of correcting the wrongly written words after identifying the wrongly written words included in all the sub-pages of the website includes the following steps:

according to a latest Chinese common word data set, a homonym data set and a confusion word data set which are acquired in advance, carrying homonyms and homonyms of the identified wrongly-identified words into an original sentence, calculating PPL, and replacing the original sentence with the minimum value of PPL; the method for calculating PPL is as follows:

taking the logarithm from two sides

PPL (Perplexity) is a model for evaluating the quality of a language model,

p(w _i ) Is the probability of the i-th word.

ppl is error corrected based on the above kenlm model recognition error-prone words. The method is to obtain all homonyms, homonyms and error-prone words of the wrongly written word, then bring the words into an original sentence, respectively calculate ppl of the sentences after the words are replaced, wherein the ppl is the word with the minimum error correction, and the probability of each word is obtained by a klm model.

An embodiment four, a method corresponding to the website mispronounced word recognition method provided in the above embodiment, where the embodiment provides a website mispronounced word recognition system, including the following steps: the system comprises a main domain name input module, a source code crawling module, a word segmentation module and a mispronounced character recognition module;

the main domain name input module is used for inputting a main domain name;

Further, on the basis of the fourth embodiment, the system further includes a kenlm model training module and an interface module, where the kenlm model training module is used to train the kenlm model, and the interface module is used to connect with each vendor input interface; the training kenlm model is specifically used for calling a word segmentation module to segment the text of the training text according to the paragraphs and punctuations, segmenting the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces through the interface module;

It should be noted that, in this embodiment, specific implementations of each module are in one-to-one correspondence with the above embodiments, and will not be described in detail in this embodiment.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are all within the protection of the present application.

Claims

1. The website wrongly written word recognition method is characterized by comprising the following steps of:

based on a pre-trained kenlm model, scoring each word by using the kenlm model, and identifying the wrong words contained in all sub-pages of the website according to the scoring result;

the training method of the kenlm model comprises the following steps:

2. The web site misplacement word recognition method as set forth in claim 1, wherein the method for obtaining text words from the grasped source code comprises the steps of:

3. The method for identifying mispronounced words in a website according to claim 1, wherein the training method of the kenlm model further comprises retraining the mispronounced word integration model according to a set time, and the specific retraining method comprises:

correcting the error of the training text based on analysis results of sentences in the same page returned by interfaces of various manufacturers, adding the corrected text results into a training corpus, and retraining the kenlm based on the corrected training corpus to obtain a new kenlm model.

4. The method for identifying mispronounced words of a website according to claim 1, wherein the step of correcting the mispronounced words after identifying the mispronounced words contained in all sub pages of the website comprises the steps of:

taking the logarithm from two sides

Wherein S is a sentence, w _i For the ith word, p (w _i ) The PPL (Perplexity) is the probability of the i-th word and is used for evaluating the quality of the language model.

5. The web site mispronounced word recognition method as set forth in claim 1, wherein the same error type in the analysis result of the sentences in the same page returned by the interfaces of the respective manufacturers is added to homonym data sets and homonym data sets respectively to obtain updated homonym data sets and homonym data sets.

6. The web site mispronounced word recognition method as claimed in claim 1, wherein long sentences are segmented into short sentences by using regularization, judging whether words of an confusion set are contained or not, if yes, adding a suspected error dictionary, and the type confusion type; after word segmentation, checking whether the words are in the common word data set, and if not, adding a suspected error dictionary.

7. The website wrongly written and wrongly written word recognition system is characterized by comprising the following steps: the system comprises a main domain name input module, a source code crawling module, a word segmentation module and a mispronounced character recognition module;

the main domain name input module is used for inputting a main domain name;

the mispronounced word recognition module is used for scoring the short sentences by utilizing the kenlm model based on the pre-trained kenlm model, and recognizing mispronounced words contained in all sub pages of the website according to the scoring result;

the training method of the kenlm model comprises the following steps:

8. The web site misprinted word recognition system of claim 7, further comprising a kenlm model training module for training a kenlm model and an interface module for interfacing with various vendor input interfaces; the training kenlm model is specifically used for calling a word segmentation module to segment the text of the training text according to the paragraphs and punctuations, segmenting the page into a plurality of sentences conforming to the length requirements of the selected manufacturer interfaces, and distributing the sentences to the manufacturer interfaces through the interface module;

9. A computer readable storage medium having stored thereon machine readable instructions which when executed by the machine cause the machine to perform the web site misplacement word recognition method of any one of the preceding claims 1 to 6.