CN114282523A

CN114282523A - Statement correction method and device based on bert model and ngram model

Info

Publication number: CN114282523A
Application number: CN202111386417.2A
Authority: CN
Inventors: 汪玉珠; 刘学谦; 田贺锁
Original assignee: Beijing Fangcun Wuyou Technology Development Co ltd
Current assignee: Beijing Fangcun Wuyou Technology Development Co ltd
Priority date: 2021-11-22
Filing date: 2021-11-22
Publication date: 2022-04-05

Abstract

The application provides a statement correction method and device based on a bert model and an ngram model, and belongs to the technical field of data processing. The method comprises the steps of determining a plurality of candidate target words used for replacing error words in the sentence, and forming a plurality of candidate sentences; for each candidate statement, calculating a first confusion degree of the statement based on a preset bert model, and calculating a second confusion degree of the statement based on a preset ngram model; a first weighting factor giving the first degree of confusion, a second weighting factor giving the second degree of confusion; determining the total confusion degree of each candidate sentence by weighting the first confusion degree and the second confusion degree; determining a final sentence based on the total confusion. The bert model and the ngram model form complementation, and accuracy of statement correction is improved.

Description

Statement correction method and device based on bert model and ngram model

Technical Field

The application belongs to the technical field of data processing, and particularly relates to a statement correction method and device based on a bert model and an ngram model.

Background

Perplexity (ppl) is an index for evaluating the quality of a language model, and the language model is a model for measuring the quality of a sentence, and essentially calculates the probability of the sentence:

for sentence s (sequence of words w): s ═ w₁,w₂,...,w_n，

Its probability is: p(s) ═ P (w)₁,w₂,...,w_n)＝p(w₁)p(w₂|w₁)...p(w_n|w₁,w₂,...,w_n-1)

The calculation formula for ppl is as follows,

as can be seen from the formula, the larger the sentence probability, the better the language model, and the smaller the confusion. There are two very big problems with this way of calculation: 1. excessive parameter space, conditional probability P (w)_n|w₁w₂...w_n-1) Too much likelihood to estimate; 2. data are sparse severely, and for combinations of a large number of word pairs, the combinations do not appear in a corpus, and the probability obtained according to the maximum likelihood estimation is 0.

In the prior art, different sentence probability calculation modes are generally designed through a bert model or an ngram model, so that the defect of a ppl calculation formula is relieved to a certain extent.

Bert is a MASK Language Model (MLM) used to predict the word probability of MASK position, and the sum of the probabilities of all words in MASK position is calculated under the condition of context of the whole word processed by MASK, which is more suitable for calculating ppl, and Bert generalization capability is stronger. In a general method, directly adding and averaging the scores of the words output by the model is the ppl confusion degree. By using bert alone, in the MLM pre-training task, the smallest mask unit is a word, and this mask method has two problems: firstly, when a partial word of a whole word is masked, an original word corresponding to the mask position can be easily predicted only by the part which is not masked, and certain information leakage exists; secondly, because the unit with the smallest semantic meaning is understood as a word, the masking mode of bert does not highlight the reduction capability of the continuous blank text, and the ngram model can well make up for the word. In addition, the bert is not suitable for incremental training due to the characteristics of the bert.

The Ngram model is a statistical model that can be incrementally trained. The difference in calculating ppl using ngram is generally in the choice of smoothing methods, and the following interpolation methods are popular:

P(w_i|w_i-2w_i-1)＝λ₃p(w_i|w_i-2w_i-1)+λ₂p(w_i|w_i-1)+λ₁p(w_i)+λ₀p

wherein

λ_iIs a parameter to be determined, p is 1/R (R is the number of terms appearing in the corpus),

p(w_i|w_i-2w_i-1)、p(w_i|w_i-1)、p(w_i) The probabilities of the 3-gram,2-gram,1-gram, respectively, for the ith word.

The ppl calculation formula is

The ngram smoothing technology cannot effectively solve the serious problem of data sparsity, is single and is based on words or characters.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present application provides a statement correction method and apparatus based on a bert model and an ngram model, which fuse the bert model and the ngram to be applied in a text error correction scenario. Specifically, text error correction is to acquire a plurality of candidate words from a candidate word list by using a target word in a sentence to be error corrected, replace the target word in the sentence to be error corrected with the candidate words, and then distinguish better candidate words by using ppl.

The first aspect of the application provides a statement correction method based on a bert model and an ngram model, which mainly comprises the following steps:

determining a plurality of candidate target words used for replacing error words in the sentence, and forming a plurality of candidate sentences;

for each candidate statement, calculating a first confusion degree of the statement based on a preset bert model, and calculating a second confusion degree of the statement based on a preset ngram model;

a first weighting factor giving the first degree of confusion, a second weighting factor giving the second degree of confusion;

determining the total confusion degree of each candidate sentence by weighting the first confusion degree and the second confusion degree;

determining a final sentence based on the total confusion.

Preferably, calculating the first confusion of the sentence based on the preset bert model comprises:

determining a bert model output value for each word in the candidate sentence;

taking an average value of the output values of the bert model of each word as a first confusion degree of the candidate sentence.

Preferably, calculating the second confusion of the sentence based on the preset ngram model comprises:

where n is the sentence length, w_iIs the ith word or word, p (w)_n|w₁......w_n-1) Means that for 1-gram, it is of the form p (w)_i) For a 2-gram, it is of the form p (w)_i|w_i-1) For 3-gram, it is of the form p (w)_i|w_i-2w_i-1) Wherein:

P(w_i|w_i-2w_i-1)＝λ₃p(w_i|w_i-2w_i-1)+λ₂p(w_i|w_i-1)+λ₁p(w_i)+λ₀p；

wherein

λ_iIf no corresponding word exists in the model, utilizing an ngram model based on characters, and calculating by using a backspace method, namely using a 3-gram firstly, if the corresponding word does not exist in the model, otherwise, using a 3-gramIf present, then the 2-gram is used, and if the 2-gram is not present, then the 1-gram is used, where the 1-gram is determined using the additive value smoothing method.

Preferably, the giving of the first and second weighting coefficients includes:

randomly giving an initial first weighting coefficient and an initial second weighting coefficient, wherein the sum of the first weighting coefficient and the second weighting coefficient is 1;

normalizing the calculation result of the bert model to each word to an interval [0,1 ];

for the ngram model, normalizing the final processing result of the sentence to an interval [0,1 ];

and applying the processing result to a total confusion degree calculation formula, and solving an optimal first weighting coefficient and an optimal second weighting coefficient by adopting a gradient descent method.

The second aspect of the present application provides a statement correcting apparatus based on a bert model and an ngram model, which mainly includes:

the candidate sentence determining module is used for determining a plurality of candidate target words used for replacing error words in the sentences and forming a plurality of candidate sentences;

the sub-confusion degree determining module is used for calculating a first confusion degree of each candidate statement based on a preset bert model and a second confusion degree of the statement based on a preset ngram model;

a weighting factor determining module for giving a first weighting factor for the first degree of confusion and a second weighting factor for the second degree of confusion;

a total-confusion-degree determining module, configured to determine a total confusion degree of each candidate sentence by weighting the first confusion degree and the second confusion degree;

a final sentence determination module to determine a final sentence based on the total confusion.

Preferably, the sub-confusion determining module includes a first confusion calculation unit including:

a bert model output value determining subunit, configured to determine a bert model output value of each word in the candidate sentence;

a first confusion degree operator unit for taking an average of the bert model output values of each word as a first confusion degree of the candidate sentence.

Preferably, the sub-confusion determining module includes a second confusion calculation unit, and the second confusion calculation unit includes:

wherein

λ_iAnd p is 1/R, R is the number of entries appearing in the corpus, if the corresponding word does not exist in the model, the word-based ngram model is utilized, and the calculation is carried out by using a backspacing method, namely, a 3-gram is firstly used, if the word-based ngram model does not exist, a 2-gram is used, if the word-based ngram model does not exist, and a 1-gram is used, wherein the addition value smoothing method used by the 1-gram is used for determining.

Preferably, the weighting coefficient determining module includes:

the device comprises an initialization unit, a calculation unit and a control unit, wherein the initialization unit is used for randomly giving an initial first weighting coefficient and an initial second weighting coefficient, and the sum of the first weighting coefficient and the second weighting coefficient is 1;

a bert model normalization unit, which is used for normalizing the calculation result of the bert model for each word to an interval [0,1 ];

the ngram model normalization unit is used for normalizing the final sentence processing result to an interval [0,1] for the ngram model;

and the gradient descent method calculation unit is used for applying the processing result to a total confusion degree calculation formula and solving the optimal first weighting coefficient and the second weighting coefficient by adopting a gradient descent method.

A third aspect of the present application provides a computer device, comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, wherein the processor executes the computer program to implement the statement modification method based on the bert model and the ngram model as described above.

A fourth aspect of the present application provides a readable storage medium, which stores a computer program, which when executed by a processor, is used to implement the statement modification method based on the bert model and the ngram model as described above.

In the method based on the ngram model, a difference method and a backspace method based on characters and words are fused, so that the smoothing method is more comprehensive. The information leakage of Bert and the reduction capability of continuous blank texts are not outstanding, and the Bert can be exactly complementary with an ngram model. The Bert pre-training model is not suitable for incremental training, while the ngram-based method is a statistical-based method, and can easily achieve incremental training. The bert model and the ngram model form complementation, and accuracy of statement correction is improved.

Drawings

FIG. 1 is a flow chart of a preferred embodiment of the statement modification method based on the bert model and the ngram model.

Fig. 2 is a schematic structural diagram of a computer device suitable for implementing a terminal or a server according to an embodiment of the present application.

Detailed Description

In order to make the implementation objects, technical solutions and advantages of the present application clearer, the technical solutions in the embodiments of the present application will be described in more detail below with reference to the accompanying drawings in the embodiments of the present application. In the drawings, the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The described embodiments are some, but not all embodiments of the present application. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application, and should not be construed as limiting the present application. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application are within the scope of protection of the present application. Embodiments of the present application will be described in detail below with reference to the drawings.

Aiming at the problems and the defects existing in the traditional management, the method tries to make an annual plan of the expected water consumption of the customers in a period under the property asset project in advance by grasping the essence of the engineering management of the water supply system, namely according to the actual change trend of the customers to the water consumption in the whole life cycle, and designs a water supply working plan in advance according to the plan and the characteristics of equipment and facilities of the project. The technical parameters of equipment and facilities such as property project water pumps and water tanks are utilized, the method of overall planning and solving is utilized to deal with the technical parameters according to the idea of solving the optimal water supply plan, the possible risks and accidents are reduced to the minimum in advance, and the method can be further extended to practice and application of various engineering management equipment and facility systems of modern buildings. And finally, the comprehensive information integration data general technical method of deep analysis and value mining is carried out according to the set comprehensive planning mathematical model, and the multivariate information integration management control of each engineering system of the modern intelligent building is realized with lower cost investment.

A first aspect of the present application provides a statement modification method based on a bert model and an ngram model, as shown in fig. 1, which mainly includes:

determining a final sentence based on the total confusion.

In some alternative embodiments, calculating the first confusion for the sentence based on the preset bert model comprises: determining a bert model output value for each word in the candidate sentence; taking an average value of the output values of the bert model of each word as a first confusion degree of the candidate sentence.

In this embodiment, the formula for calculating the sentence ppl of ppl based on bert is:

where n is the number of words in the sentence, p_iThe values are output for the bert model for each word.

In some alternative embodiments, calculating the second confusion for the sentence based on the preset ngram model comprises:

wherein

In a particular embodiment, p_ciTo be distinguished from word-based, p_ziThe term "word-based" is used herein, p is a generic term for both, and the following is a probability calculation formula for a word based on the word, which is the same as:

P_ci(w_i|w_i-2w_i-1)＝λ₃p_ci(w_i|w_i-2w_i-1)+λ₂p_ci(w_i|w_i-1)+λ₁p_ci(w_i)+λ₀p_ci。

in some alternative embodiments, giving the first and second weighting coefficients comprises: randomly giving an initial first weighting coefficient and an initial second weighting coefficient, wherein the sum of the first weighting coefficient and the second weighting coefficient is 1; normalizing the calculation result of the bert model to each word to an interval [0,1 ]; for the ngram model, normalizing the final processing result of the sentence to an interval [0,1 ]; and solving the optimal first weighting coefficient and the optimal second weighting coefficient by adopting a gradient descent method.

In the embodiment, because one method is independently utilized, the method is not as good as fusion, when the sentence 'colored visit' is calculated, a person sees a character line in the office of the forest resolute to know, and knows that the character line is learnt to practice and is only certified to be known as a whole. The ' wrong word is ' known to be integrated ' and is changed into ' self-integrated ', and when two sentences are distinguished, the bert model is not as good as the ngram model. And the calculation sentence "all this makes I feel curious. All of which give me a curiosity. "the bert model performs better, so the results of the two models are fused. Most preferablyThe final ppl is calculated by the formula ppl ═ λ ppl_bert+(1-λ)ppl_ngramWhere λ is the parameter to be determined, ppl_bert、ppl_ngremThe results of the calculations are based on the bert model, and on the ngram model, respectively.

Since the degree of confusion based on bert and ngram was calculated under different evaluation systems, both results will be normalized separately.

For the bert model, the results score for each word was normalized to the interval [0,1 ].

The formula used is:

p'(w_i)＝p(w_i)/max_w；

wherein p (w)_i) Is the score of the ith word, max_wThe value with the largest score in a sentence,

for the ngram model, the final result of sentence pair is normalized to the interval [0, 1%]The formula used is as follows,

and (3) randomly initializing a value for determining lambda, and then solving and determining by using a gradient descent method, namely applying the processing result to a total confusion degree calculation formula, and solving an optimal first weighting coefficient and an optimal second weighting coefficient by using the gradient descent method.

A second aspect of the present application provides a statement correcting apparatus based on a bert model and an ngram model corresponding to the foregoing method, which mainly includes:

In some optional embodiments, the sub-confusion determination module comprises a first confusion calculation unit, the first confusion calculation unit comprising:

In some optional embodiments, the sub-confusion determination module comprises a second confusion calculation unit, the second confusion calculation unit comprising:

wherein

λ_iIs a parameter to be determined, p is 1/R, R is the number of entries present in the corpus, and word-based is utilized if there is no corresponding word present in the modelThe ngram model is calculated by using a backspacing method, namely using a 3-gram, if the 3-gram does not exist, using a 2-gram, if the 2-gram does not exist, using a 1-gram, wherein the 1-gram is determined by using a value-added smoothing method.

In some optional embodiments, the weighting factor determining module comprises:

In a third aspect of the present application, a computer device includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement a sentence modification method based on a bert model and an ngram model.

In a fourth aspect of the present application, a readable storage medium stores a computer program, which when executed by a processor, is used for implementing a statement modification method based on a bert model and an ngram model as described above. The computer-readable storage medium may be included in the apparatus described in the above embodiment; or may be present separately and not assembled into the device. The computer readable storage medium carries one or more programs which, when executed by the apparatus, process data in the manner described above.

Referring now to FIG. 2, there is shown a schematic block diagram of a computer device 400 suitable for use in implementing embodiments of the present application. The computer device shown in fig. 2 is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

As shown in fig. 2, the computer apparatus 400 includes a Central Processing Unit (CPU)401 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM403, various programs and data necessary for the operation of the device 400 are also stored. The CPU401, ROM402, and RAM403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.

In particular, according to embodiments of the present application, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The computer program performs the above-described functions defined in the method of the present application when executed by a Central Processing Unit (CPU) 401. It should be noted that the computer storage media of the present application can be computer readable signal media or computer readable storage media or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules or units described in the embodiments of the present application may be implemented by software or hardware. The modules or units described may also be provided in a processor, the names of which in some cases do not constitute a limitation of the module or unit itself.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A statement correction method based on a bert model and an ngram model is characterized by comprising the following steps:

determining a final sentence based on the total confusion.

2. The method of claim 1, wherein calculating the first confusion of the sentence based on the preset bert model comprises:

determining a bert model output value for each word in the candidate sentence;

3. The method of claim 1, wherein calculating the second confusion of the sentence based on the predetermined ngram model comprises:

wherein

4. The bert model and ngram model-based sentence modification method of claim 1, wherein the giving of the first weighting coefficient and the second weighting coefficient comprises:

5. A sentence correcting device based on a bert model and an ngram model is characterized by comprising:

6. The bert model and ngram model-based sentence modification apparatus of claim 5, wherein the sub-confusion determining module comprises a first confusion calculating unit, the first confusion calculating unit comprising:

7. The bert model and ngram model-based sentence modification apparatus of claim 5, wherein the sub-confusion determining module comprises a second confusion calculating unit, the second confusion calculating unit comprising:

wherein

8. The apparatus for sentence modification based on the bert model and the ngram model of claim 5 wherein the weighting factor determining module comprises:

9. A computer device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the processor executing the computer program for implementing the bert model and ngram model-based statement correction method according to any one of claims 1 to 4.

10. A readable storage medium storing a computer program, wherein the computer program is used for implementing the bert model and ngram model-based statement modification method according to any one of claims 1 to 4 when being executed by a processor.