WO2021208727A1 - Text error detection method and apparatus based on artificial intelligence, and computer device - Google Patents

Text error detection method and apparatus based on artificial intelligence, and computer device Download PDF

Info

Publication number
WO2021208727A1
WO2021208727A1 PCT/CN2021/083936 CN2021083936W WO2021208727A1 WO 2021208727 A1 WO2021208727 A1 WO 2021208727A1 CN 2021083936 W CN2021083936 W CN 2021083936W WO 2021208727 A1 WO2021208727 A1 WO 2021208727A1
Authority
WO
WIPO (PCT)
Prior art keywords
detection
text
information
model
error
Prior art date
Application number
PCT/CN2021/083936
Other languages
French (fr)
Chinese (zh)
Inventor
回艳菲
王健宗
吴天博
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021208727A1 publication Critical patent/WO2021208727A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to an artificial intelligence-based text error detection method, device, and computer equipment.
  • Chinese has always been regarded as one of the most difficult languages in the world to learn. In its long history of use, it has accumulated many differences from other languages. For example, unlike English, Chinese has neither singular or plural changes nor tense changes of verbs. The expression is more flexible, the grammatical structure is loose, there are more short sentences and fewer clauses. There are usually many ways to express the same meaning in Chinese.
  • ASR Automatic Speech Recognition
  • the text information may contain text errors, and the company may also receive the user through the keyboard or touch screen and other devices.
  • the input of near-phonetic characters is wrong (the pronunciation is the same but the input is wrong), and the text error will affect the reading comprehension, and even have an important impact on the company in the process of handling the business for the customer, causing huge losses to the company.
  • the embodiments of the present application provide an artificial intelligence-based text error detection method, device, computer equipment, and storage medium, aiming to solve the problem of low detection efficiency and detection accuracy of Chinese text errors in the prior art methods.
  • an embodiment of the present application provides an artificial intelligence-based text error detection method, which includes:
  • an artificial intelligence-based text error detection device which includes:
  • the detection model configuration unit is configured to receive model configuration information input by the user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
  • the detection model training unit is configured to train multiple error detection models according to a preset conversion dictionary and a pre-stored training corpus database to obtain multiple error detection models after training;
  • the model checking information acquiring unit is configured to, if the to-be-detected text input by the user is received, input the to-be-detected text into multiple error detection models to obtain corresponding multiple model detection information;
  • the model checking information screening unit is used to screen a plurality of the model checking information to obtain screening and checking data that meets preset screening conditions;
  • the integrated processing unit is configured to perform integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
  • an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer
  • the program implements the artificial intelligence-based text error detection method described in the first aspect above.
  • an embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first On the one hand, the artificial intelligence-based text error detection method.
  • the embodiments of the application provide an artificial intelligence-based text error detection method, device, computer equipment, and storage medium.
  • the information is screened to obtain screening test data that meets the preset screening conditions; the screening test data is integrated to obtain the text detection result.
  • multiple error detection models are constructed to obtain multiple model detection information corresponding to the text to be detected, and the model detection information is filtered and integrated to obtain the text detection result. Compared with the fixed template matching method, it can be greatly improved. Improve the efficiency and accuracy of error detection for Chinese text.
  • FIG. 1 is a schematic flowchart of an artificial intelligence-based text error detection method provided by an embodiment of the application
  • FIG. 2 is a schematic diagram of a sub-flow of an artificial intelligence-based text error detection method provided by an embodiment of the application;
  • FIG. 3 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application.
  • FIG. 6 is a schematic diagram of another sub-process of the artificial intelligence-based text error detection method provided by an embodiment of the application.
  • FIG. 7 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application.
  • FIG. 8 is a schematic block diagram of an artificial intelligence-based text error detection device provided by an embodiment of the application.
  • FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the application.
  • FIG. 1 is a schematic flowchart of an artificial intelligence-based text error detection method provided by an embodiment of the present application.
  • the artificial intelligence-based text error detection method is applied to a user terminal.
  • the application software is executed, and the user terminal is a terminal device used for error detection of the text to be detected input by the user, such as a desktop computer, a notebook computer, a tablet computer, or a mobile phone.
  • the user terminal can also be an enterprise server constructed by an enterprise. As shown in Figure 1, the method includes steps S110 to S150.
  • S110 Receive model configuration information input by the user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models.
  • the user can input model configuration information to configure the parameter values in the initialized detection model to obtain multiple error detection models.
  • the initialized detection model includes the Long Short-Term Memory (LSTM, Long Short-Term Memory) weight layer
  • the state transition matrix can include the number of model configurations, weight layer configuration information, and transition matrix configuration information.
  • the state transition matrix in the initialized detection model is configured through the transition matrix configuration information, and the initialization is initialized through the weight layer configuration information.
  • Configure the weight layer in the detection model and create multiple error detection models according to the number of model configurations.
  • Each error detection model includes a long and short-term memory network, a configured weight layer, and a configured state transition matrix.
  • the long and short-term memory network is used to calculate the input text information to obtain the memory network output information
  • the weight layer is used to weight the memory network output information to obtain the weighted memory network output information
  • the state transition matrix is used to weight The output information of the subsequent memory network undergoes state transition processing to obtain model detection information, and the file detection result corresponding to the text information can be obtained by analyzing the model detection information.
  • S120 Training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database to obtain a plurality of error detection models after training.
  • multiple error detection models Before using multiple error detection models to perform error detection on Chinese text, multiple error detection models need to be trained separately. Specifically, multiple error detection models can be trained according to the conversion dictionary and training corpus database. Pre-stored loss function calculation formulas and gradient calculation formulas are also used in the training corpus.
  • the training corpus database contains multiple training corpora. Each training sentence is composed of corpus information and target detection information.
  • the corpus information is the corpus text information, and the target detection The information is the correct detection information corresponding to the corpus information, and the target detection information can be obtained by artificial judgment based on the corpus information and correspondingly added.
  • step S120 includes sub-steps S121, S122, and S123.
  • the training corpus contained in one training corpus set can complete the training of an error detection model.
  • the number of error detection models is 10
  • the number of training corpus in the training corpus database is 2000
  • the training corpus is randomly allocated to obtain 10 training corpus sets, and each training corpus set contains 200 training corpora.
  • S122 Convert the corpus information in the multiple training corpus sets according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information.
  • the corpus information in the plurality of training corpus sets is converted according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information.
  • Each character can be matched to a corresponding feature code in the conversion dictionary.
  • the feature code is a 1 ⁇ M-dimensional vector.
  • the characters contained in each corpus information in the training corpus can be converted according to the conversion dictionary.
  • the feature code corresponding to each character is combined to obtain the corresponding corpus code.
  • the obtained corpus code represents the feature of the corpus information in a vector manner.
  • the size of the corpus code is (N, M), which means that the corpus code is N lines
  • each corpus code in the training corpus corresponds to a piece of preset target detection information.
  • the error detection model is iteratively trained according to the corpus codes in a training corpus set and the target detection information corresponding to each corpus code, and the error detection model is iteratively trained in combination with the pre-stored loss function calculation formula and gradient calculation formula until The corpus encoding and target detection information in the training corpus are all used for model training and then the training is stopped, and a trained error detection model can be obtained.
  • step S123 includes sub-steps S1231, S1232, S1233, and S1234.
  • a corpus code is an N ⁇ M-dimensional vector.
  • C(t) C(t_1) ⁇ f(t)+i(t) ⁇ a(t)
  • C is the accumulated cell memory information in each calculation process
  • C(t) is the current The cell memory information output by the cell
  • C(t_1) is the cell memory information output by the previous cell
  • is the vector operator
  • the calculation process of C(t_1) ⁇ f(t) is the calculation process of each in the vector C(t_1)
  • the one-dimensional value is respectively multiplied by f(t), and the calculated vector dimension is the same as the dimension in the vector C(t_1).
  • Each cell is calculated to obtain an output information h(t), and the output information of N cells can be combined to obtain a corpus-encoded memory network output information S, and a corpus-encoded memory network output information is an N ⁇ M-dimensional vector ,
  • the number of weight values contained in the weight layer is equal to M, and the output information of a corpus-encoded memory network is multiplied by the weight layer (each 1 ⁇ M dimensional vector in the output information of the memory network is multiplied by the weight layer),
  • the output information P of the memory network and the state transition matrix A with the weight value added are the training detection information.
  • S1232 Calculate the loss value between the training detection information and the target detection information encoded by the corpus according to the pre-stored loss function calculation formula.
  • the loss function calculation formula can be expressed by formula (1):
  • L is the calculated loss value
  • S(X, Y) is the score of the target detection information
  • S(X, Y') is the score of the training detection information
  • X is the input corpus code
  • Y is the target detection information contains
  • the error type label of Y' is the error type label contained in the training detection information
  • Y X is the label of all possible error types.
  • the score can be calculated using formula (2);
  • S1233 Calculate the updated value of the transition matrix in the error detection model according to the pre-stored gradient calculation formula, the loss value and the calculated value of the training detection information, and update the parameter value of the transition matrix.
  • the calculated loss value and the calculated value of the training detection information is calculated, and the parameter value in the transition matrix is updated through the updated value.
  • This is the process of training the error detection model That is, gradient descent calculation.
  • the gradient calculation formula can be expressed by formula (3):
  • the error detection model can be trained for multiple iterations according to the above process.
  • the text to be detected can be input into each error detection model for calculation, and the model detection information of each error detection model is obtained correspondingly.
  • the text to be detected is input into an error detection Model, the output information of the model can be obtained.
  • the output information of the model includes the output information of the memory network with the additional weight value and the state transition matrix. According to the state transition matrix, the output information of the memory network with the additional weight value is transferred to obtain the model detection information.
  • step S130 includes sub-steps S131 and S132.
  • the text code corresponding to the text to be detected can be obtained through the above conversion dictionary.
  • the text to be detected is a sentence of Chinese text information.
  • the text code is an N ⁇ M-dimensional vector obtained by converting the text to be detected.
  • the text codes are respectively Input multiple error detection models for calculation. The specific calculation process is the same as the calculation process for corpus encoding, so I will not repeat it here.
  • the state transition matrix of an error detection model is correspondingly accumulated to the memory network output information of the error detection model with the additional weight value, and the state transition of the memory network output information with the additional weight value can be realized, and the corresponding model detection information can be obtained.
  • the model detection information calculated by an error detection model is an N ⁇ M-dimensional vector, where N represents the total number of characters, and M represents the number of error type tags, which is used to represent the score value of each character in the text to be detected and each error type tag .
  • the screening checking data that meets the preset screening conditions can be filtered from the model checking information.
  • the screening conditions include detection data screening ratio and model screening ratio, and the screening detection data includes multiple sets of detection data information.
  • step S140 includes sub-steps S141, S142, and S143.
  • Calculate the comprehensive detection score of each model detection information accumulate all the score values in a model detection information to obtain the comprehensive detection score of the model detection information, and sort the model detection information according to the comprehensive detection score.
  • the detection information of multiple models with the highest ranking of comprehensive detection scores is intercepted, and multiple sets of candidate model detection information are obtained. For example, if the model screening ratio is 80%, and the model detection information is 10, the two model detection information of the inverse order of the comprehensive detection score are eliminated, and the remaining 8 candidate model detection information are obtained.
  • Each group of candidate model detection information is screened according to the detection data screening ratio, and multiple detection data ranked at the top are obtained as the detection data information of the group of candidate model detection information. For example, if a set of candidate model detection information contains N ⁇ M score values, and the detection data screening ratio is 40%, then the first 40% of the N ⁇ M score values will be obtained and retained, and the retained 40% will be retained. The value is used as the detection data information of the candidate model detection information.
  • the screening test data can be integrated and processed to obtain a text detection result that matches the text to be detected.
  • the screening detection data contains multiple sets of detection data information, the corresponding text error location and text error type can be determined according to each set of detection data information, and the text error location and text error type of multiple sets of detection data information can be integrated processing , Get the unified text error location and text error type as the text detection result of the text to be detected.
  • the obtained text detection results can be uploaded to the blockchain for storage.
  • the corresponding summary information may be obtained based on the text detection result.
  • the summary information is obtained by hashing the text detection result, for example, obtained by the sha256s algorithm.
  • Uploading summary information to the blockchain can ensure its security and fairness and transparency to users.
  • the user can download the summary information from the blockchain through the user terminal to verify whether the text detection result has been tampered with.
  • the blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • step S150 includes sub-steps S151, S152, S153, S154, S155, and S156.
  • S151 Determine the text error location and text error type of each group of detection data information according to each group of detection data information in the screening detection data.
  • multiple score values of a character in a group of detection data information are obtained, and the error type label corresponding to the highest score value among the multiple score values of the character is obtained as the text error type of the group of detection data information. If a character does not contain a score value in the detection data information, it is considered that the character does not contain an error.
  • the text error type of multiple characters the text error position corresponding to the group of detection data information is determined, and all the characters that match the text error type are obtained. The position in the text to be detected is used as the text error position of the group of detected data information.
  • the text error position of a certain group of detected data information is character 5 to character 9
  • the text error position of another group of detected data information is character 6 to character 10.
  • the text error positions of the two groups of detected data information are not the same.
  • a set of text error positions of the detected data information can be obtained, and it can be judged according to the judgment rule whether the text error position contains the segmented phrase, and if the text error position contains the segmented phrase, the text is determined to be obtained
  • the error position contains a word segmentation error.
  • the text error position of a certain group of detected data information is character 5 to character 9; get the 4th character, 5th character and 6th character in the text to be detected, and judge the 4th character and the 5th character according to the judgment rule Whether the probability of a character combination becoming a phrase is greater than the probability of combining the 5th character and the 6th character to become a phrase, if it is greater, the text error position contains the segmented phrase, if it is not greater, the same method is used to obtain the text to be detected In the 8th character, 9th character and 10th character, continue to determine whether it contains the segmented phrase. Delete the text error text containing the word segmentation error, and again judge whether the remaining text error positions are the same.
  • S155 If all the text error positions are the same, determine whether all the text error types are the same. S156: If all the text error types are the same, output the text error location and the text error type as the text detection result.
  • the text error positions are all the same, determine whether the multiple text error types corresponding to each character in the text error position are all the same. If multiple text error types corresponding to each character in the text error position are the same, the currently obtained text error position and text error type can be output as the text detection result. If all the text error types are not the same, a prompt message that the text detection result cannot be obtained can be sent to prompt the user.
  • steps S1521 and S1522 are further included before step S152.
  • S1521 matches the text error position and text error type of each group of the detected data information; S1522, removes the text error position and text error type that match the grammar template.
  • the text error position and text error type of each group of detected data information can also be matched according to the preset grammar template. If the text error position and text error type of a certain group of detected data information match any grammar template If it matches, the text error position and text error type of the group of test data information are eliminated.
  • the preset grammar templates include some grammatical rules that cannot be detected by the error detection model. Hundreds of grammar templates can be pre-configured, and each grammar template can be judged in turn whether it corresponds to a set of text error positions and texts of the detection data information. The error type matches.
  • the technical methods in this application can be applied to smart government affairs/smart city management/smart communities/smart security/smart logistics/smart healthcare/smart education/smart environmental protection/smart transportation and other application scenarios that include error detection of Chinese texts, thereby promoting The construction of smart cities.
  • multiple error detection models are constructed according to the model configuration information and the initialized detection model; the multiple error detection models are trained separately, and the text to be detected is input
  • the trained multiple error detection models acquire multiple model detection information, and filter the model detection information to obtain the screening detection data that meets the preset screening conditions; perform integrated processing on the screening detection data to obtain the text detection result.
  • multiple error detection models are constructed to obtain multiple model detection information corresponding to the text to be detected, and the model detection information is filtered and integrated to obtain the text detection result. Compared with the fixed template matching method, it can be greatly improved. Improve the efficiency and accuracy of error detection for Chinese text.
  • the embodiment of the present application also provides an artificial intelligence-based text error detection device, which is used to execute any embodiment of the aforementioned artificial intelligence-based text error detection method.
  • FIG. 8 is a schematic block diagram of a text error detection apparatus provided by an embodiment of the present application.
  • the artificial intelligence-based text error detection device can be configured in a user terminal.
  • the text error detection device 100 based on artificial intelligence includes a detection model configuration unit 110, a detection model training unit 120, a model detection information acquisition unit 130, a model detection information screening unit 140 and an integrated processing unit 150.
  • the detection model configuration unit 110 is configured to receive model configuration information input by a user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models.
  • the detection model training unit 120 is configured to train multiple error detection models according to a preset conversion dictionary and a pre-stored training corpus database to obtain multiple error detection models after training.
  • the detection model training unit 120 includes sub-units: a training corpus acquisition unit, a corpus code acquisition unit, and an iterative training unit.
  • the training corpus acquisition unit is used to randomly allocate the training corpus of the training corpus database according to the number of the error detection models to obtain multiple training corpus sets with the same number;
  • the corpus code acquisition unit is used to The conversion dictionary converts the corpus information in the multiple training corpus sets to obtain a piece of corpus information corresponding to each of the corpus information;
  • One of the error detection models corresponding to the training corpus is iteratively trained to obtain a trained error detection model corresponding to each of the training corpus sets, wherein each corpus code in the training corpus corresponds to A preset target detection information.
  • the iterative training unit includes subunits: a training detection information acquisition unit, a loss value calculation unit, a transition matrix parameter update unit, and a repeat unit.
  • the training detection information acquisition unit is used to input a corpus code in the training corpus set into the error detection model to obtain training detection information corresponding to the corpus code;
  • the function calculation formula calculates the loss value between the training detection information and the target detection information encoded by the corpus;
  • the transition matrix parameter update unit is used to calculate the loss value according to the pre-stored gradient calculation formula, the loss value and the training detection information
  • the calculated value is calculated to obtain the updated value of the transition matrix in the error detection model, and the parameter value of the transition matrix is updated;
  • the repeat unit is used to obtain the next piece of corpus encoding information in the training corpus set and input the error detection model and Repeat the above steps until all the corpus coding information included in the training corpus set is used for training.
  • the model checking information acquiring unit 130 is configured to, if the to-be-detected text input by the user is received, input the to-be-detected text into a plurality of error detection models respectively to obtain a plurality of corresponding model detection information.
  • the model detection information acquiring unit 130 includes sub-units: a text encoding acquiring unit and a text encoding calculating unit.
  • the text code acquisition unit is configured to convert the text to be detected into a corresponding text code according to the conversion dictionary; the text code calculation unit is configured to input the text code into multiple error detection models for calculation, respectively Obtain model detection information output by each of the error detection models.
  • the model checking information screening unit 140 is configured to obtain screening and checking data satisfying preset screening conditions from a plurality of the model checking information.
  • the model detection information screening unit 140 includes subunits: a model detection information ranking unit, a model detection information intercepting unit, and a detection data screening unit.
  • the model checking information sorting unit is used to obtain the comprehensive detection score of each model checking information, and to sort a plurality of model checking information according to the comprehensive detection score;
  • the model checking information intercepting unit is used to intercept the selection ratio according to the model
  • the multiple model detection information ranked at the top are used as multiple sets of candidate model detection information;
  • the detection data screening unit is used to screen each set of candidate model detection information according to the detection data screening ratio, and obtain each The plurality of detection data ranked higher in the group of candidate model detection information is used as a group of the detection data information.
  • the integrated processing unit 150 is configured to perform integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
  • the integrated processing unit 150 includes sub-units: an error location type determination unit, a text error location judgment unit, a word segmentation error judgment unit, a deletion unit, a text error type judgment unit, and a text detection result acquisition unit.
  • the error location type determining unit is used to determine the text error location and text error type of each group of detection data information according to each group of detection data information in the screening detection data; the text error location judgment unit is used to judge all the text errors Whether the positions are all the same; the word segmentation error judgment unit is used for determining whether all the text error positions contain word segmentation errors according to preset judgment rules if all the text error positions are not the same; the deletion unit is used for Delete the text error position containing the word segmentation error and return to execute the step of judging whether all the text error positions are the same; the text error type judgment unit is used for judging all the text error positions if all the text error positions are the same Whether the text error types are all the same; the text detection result obtaining unit is configured to output the text error location and the text error type as the text detection result if all the text error types are the same.
  • the integrated processing unit 150 further includes sub-units: a grammar template matching unit and a rejection unit.
  • the grammar template matching unit is used to match the text error position and text error type of each group of the detected data information according to the preset grammar template; the elimination unit is used to match the text error position of the grammar template And text error types are eliminated.
  • the artificial intelligence-based text error detection device applies the above-mentioned artificial intelligence-based text error detection method to construct multiple error detection models according to the model configuration information and the initialized detection model; for multiple error detection models Train separately, input the text to be detected into multiple error detection models after training to obtain multiple model detection information, filter the model detection information to obtain the screening detection data that meets the preset screening conditions; perform integrated processing on the screening detection data to obtain the text Test results.
  • multiple error detection models are constructed to obtain multiple model detection information corresponding to the text to be detected, and the model detection information is filtered and integrated to obtain the text detection result. Compared with the fixed template matching method, it can be greatly improved. Improve the efficiency and accuracy of error detection for Chinese text.
  • the above text error detection device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9.
  • FIG. 9 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device may be a user terminal for executing an artificial intelligence-based text error detection method to perform error detection on Chinese text.
  • the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
  • the non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032.
  • the processor 502 can execute a text error detection method based on artificial intelligence.
  • the processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
  • the internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can make the processor 502 execute a text error detection method based on artificial intelligence.
  • the network interface 505 is used for network communication, such as providing data information transmission.
  • the structure shown in FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied.
  • the specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
  • the processor 502 is configured to run a computer program 5032 stored in a memory to implement the corresponding function in the above-mentioned artificial intelligence-based text error detection method.
  • the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific configuration of the computer device.
  • the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged.
  • the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 9 and will not be repeated here.
  • the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
  • a computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium stores a computer program, where the computer program implements the steps included in the above-mentioned artificial intelligence-based text error detection method when the computer program is executed by the processor.
  • the disclosed equipment, device, and method may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the read storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned computer-readable storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text error detection method and apparatus based on artificial intelligence, and a computer device, which can be applied to an application scenario of text error detection in a smart city. The method comprises: according to model configuration information and an initialized detection model, performing construction to obtain multiple error detection models (S110); respectively training the multiple error detection models (S120); inputting text to be subjected to detection into the multiple trained error detection models, so as to acquire multiple pieces of model detection information (S130); screening the model detection information to obtain screened detection data that meets a condition (S140); and performing integration processing on the screened detection data to obtain a text checking result (S150). The method is based on intelligent decision-making technology, belongs to the field of artificial intelligence, and relates to blockchain technology. A text checking result can be uploaded into a blockchain. Multiple error detection models are constructed to respectively acquire multiple pieces of model detection information, and screening and integration processing are performed on the model detection information to obtain a text checking result, so that the efficiency and accuracy of performing error detection on Chinese text can be greatly improved.

Description

基于人工智能的文本错误检测方法、装置、计算机设备Artificial intelligence-based text error detection method, device and computer equipment
本申请要求于2020年11月24日提交中国专利局、申请号为202011329034.7,发明名称为“基于人工智能的文本错误检测方法、装置、计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on November 24, 2020, the application number is 202011329034.7, and the invention title is "artificial intelligence-based text error detection method, device, computer equipment", and the entire content of it is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种基于人工智能的文本错误检测方法、装置、计算机设备。This application relates to the field of artificial intelligence technology, and in particular to an artificial intelligence-based text error detection method, device, and computer equipment.
背景技术Background technique
中文一直被认为是世界上最难学的语言之一。在其漫长的使用历史中,它积累了许多不同于其他语言的地方,比如与英语不同的是,汉语既没有单复数变化,也没有动词的时态变化。表达更灵活,语法结构松散,短句较多,从句较少,在中文中通常存在多种表达同一意思的方式。在日常使用过程中,企业会通过自动语音识别技术(Automatic Speech Recognition,ASR)对语音信息进行识别得到相应文字信息,文字信息中可能包含文本错误,企业还可能接收到用户通过键盘或触摸屏等设备输入的近音字错误(读音相同但文字输入错误),文本错误会影响阅读理解,甚至对企业在为客户办理业务过程中产生重要影响,给企业带来巨大损失。Chinese has always been regarded as one of the most difficult languages in the world to learn. In its long history of use, it has accumulated many differences from other languages. For example, unlike English, Chinese has neither singular or plural changes nor tense changes of verbs. The expression is more flexible, the grammatical structure is loose, there are more short sentences and fewer clauses. There are usually many ways to express the same meaning in Chinese. In the process of daily use, the company will recognize the voice information through Automatic Speech Recognition (ASR) to obtain the corresponding text information. The text information may contain text errors, and the company may also receive the user through the keyboard or touch screen and other devices. The input of near-phonetic characters is wrong (the pronunciation is the same but the input is wrong), and the text error will affect the reading comprehension, and even have an important impact on the company in the process of handling the business for the customer, causing huge losses to the company.
然而发明人发现,传统技术方法中均是采用固定模板匹配的方式对中文文本进行错误检测,然而这一技术方法在实现过程中必须构建庞大的匹配模板库,通过海量匹配模块对文本错误进行匹配检测,导致检测效率较低;并且由于中文文本表达方式灵活,不同语境下对相同文本进行判断会存在截然相反的结果,因此这一技术方法难以获取准确的错误检测结果。因此,现有技术方法中的中文文本错误检测方法存在检测效率及检测准确率较低的问题。However, the inventor found that traditional technical methods use fixed template matching to detect errors in Chinese text. However, this technical method must build a huge matching template library during the implementation process, and match text errors through a large number of matching modules. Detection results in low detection efficiency; and because of the flexible expression of Chinese text, there will be diametrically opposite results when judging the same text in different contexts. Therefore, this technical method is difficult to obtain accurate error detection results. Therefore, the Chinese text error detection method in the prior art method has the problems of low detection efficiency and detection accuracy.
发明内容Summary of the invention
本申请实施例提供了一种基于人工智能的文本错误检测方法、装置、计算机设备及存储介质,旨在解决现有技术方法中所存在的中文文本错误检测效率及检测准确率较低的问题。The embodiments of the present application provide an artificial intelligence-based text error detection method, device, computer equipment, and storage medium, aiming to solve the problem of low detection efficiency and detection accuracy of Chinese text errors in the prior art methods.
第一方面,本申请实施例提供了一种基于人工智能的文本错误检测方法,其包括:In the first aspect, an embodiment of the present application provides an artificial intelligence-based text error detection method, which includes:
接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型;Receiving model configuration information input by a user, and configuring parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型;Training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database to obtain a plurality of error detection models after training;
若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息;If the text to be detected input by the user is received, input the text to be detected into multiple error detection models to obtain corresponding multiple model detection information;
从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;Screening and obtaining screening detection data satisfying preset screening conditions from the plurality of model detection information;
对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。Performing integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
第二方面,本申请实施例提供了一种基于人工智能的文本错误检测装置,其包括:In the second aspect, an embodiment of the present application provides an artificial intelligence-based text error detection device, which includes:
检测模型配置单元,用于接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型;The detection model configuration unit is configured to receive model configuration information input by the user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
检测模型训练单元,用于根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型;The detection model training unit is configured to train multiple error detection models according to a preset conversion dictionary and a pre-stored training corpus database to obtain multiple error detection models after training;
模型检测信息获取单元,用于若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息;The model checking information acquiring unit is configured to, if the to-be-detected text input by the user is received, input the to-be-detected text into multiple error detection models to obtain corresponding multiple model detection information;
模型检测信息筛选单元,用于从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;The model checking information screening unit is used to screen a plurality of the model checking information to obtain screening and checking data that meets preset screening conditions;
集成处理单元,用于对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。The integrated processing unit is configured to perform integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
第三方面,本申请实施例又提供了一种计算机设备,其包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面所述的基于人工智能的文本错误检测方法。In a third aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and running on the processor, and the processor executes the computer The program implements the artificial intelligence-based text error detection method described in the first aspect above.
第四方面,本申请实施例还提供了一种计算机可读存储介质,其中所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行上述第一方面所述的基于人工智能的文本错误检测方法。In a fourth aspect, an embodiment of the present application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the processor executes the above-mentioned first On the one hand, the artificial intelligence-based text error detection method.
本申请实施例提供了一种基于人工智能的文本错误检测方法、装置、计算机设备及存储介质。根据模型配置信息及初始化的检测模型构建得到多个错误检测模型;对多个错误检测模型分别进行训练,将待检测文本输入训练后的多个错误检测模型获取多个模型检测信息,从模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;对筛选检测数据进行集成处理得到文本检测结果。通过上述方法,通过构建多个错误检测模型分别获取与待检测文本对应的多个模型检测信息,对模型检测信息进行筛选并集成处理得到文本检测结果,相比采用固定模板匹配的方式,可大幅提升对中文文本进行错误检测的效率及准确性。The embodiments of the application provide an artificial intelligence-based text error detection method, device, computer equipment, and storage medium. Build multiple error detection models according to the model configuration information and the initialized detection model; train multiple error detection models separately, input the text to be detected into the trained multiple error detection models to obtain multiple model detection information, and then check from the model The information is screened to obtain screening test data that meets the preset screening conditions; the screening test data is integrated to obtain the text detection result. Through the above method, multiple error detection models are constructed to obtain multiple model detection information corresponding to the text to be detected, and the model detection information is filtered and integrated to obtain the text detection result. Compared with the fixed template matching method, it can be greatly improved. Improve the efficiency and accuracy of error detection for Chinese text.
附图说明Description of the drawings
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.
图1为本申请实施例提供的基于人工智能的文本错误检测方法的流程示意图;FIG. 1 is a schematic flowchart of an artificial intelligence-based text error detection method provided by an embodiment of the application;
图2为本申请实施例提供的基于人工智能的文本错误检测方法的子流程示意图;FIG. 2 is a schematic diagram of a sub-flow of an artificial intelligence-based text error detection method provided by an embodiment of the application;
图3为本申请实施例提供的基于人工智能的文本错误检测方法的另一子流程示意图;FIG. 3 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application;
图4为本申请实施例提供的基于人工智能的文本错误检测方法的另一子流程示意图;FIG. 4 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application;
图5为本申请实施例提供的基于人工智能的文本错误检测方法的另一子流程示意图;FIG. 5 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application;
图6为本申请实施例提供的基于人工智能的文本错误检测方法的另一子流程示意图;FIG. 6 is a schematic diagram of another sub-process of the artificial intelligence-based text error detection method provided by an embodiment of the application;
图7为本申请实施例提供的基于人工智能的文本错误检测方法的另一子流程示意图;FIG. 7 is a schematic diagram of another sub-flow of the artificial intelligence-based text error detection method provided by an embodiment of the application;
图8为本申请实施例提供的基于人工智能的文本错误检测装置的示意性框图;FIG. 8 is a schematic block diagram of an artificial intelligence-based text error detection device provided by an embodiment of the application;
图9为本申请实施例提供的计算机设备的示意性框图。FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描 述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
应当理解,当在本说明书和所附权利要求书中使用时,术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.
还应当理解,在此本申请说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本申请。如在本申请说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.
还应当进一步理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .
请参阅图1,图1是本申请实施例提供的基于人工智能的文本错误检测方法的流程示意图,该基于人工智能的文本错误检测方法应用于用户终端中,该方法通过安装于用户终端中的应用软件进行执行,用户终端即是用于对用户输入的待检测文本进行错误检测的终端设备,例如台式电脑、笔记本电脑、平板电脑或手机等,用户终端也可以是企业所构建的企业服务器。如图1所示,该方法包括步骤S110~S150。Please refer to FIG. 1. FIG. 1 is a schematic flowchart of an artificial intelligence-based text error detection method provided by an embodiment of the present application. The artificial intelligence-based text error detection method is applied to a user terminal. The application software is executed, and the user terminal is a terminal device used for error detection of the text to be detected input by the user, such as a desktop computer, a notebook computer, a tablet computer, or a mobile phone. The user terminal can also be an enterprise server constructed by an enterprise. As shown in Figure 1, the method includes steps S110 to S150.
S110、接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型。S110: Receive model configuration information input by the user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models.
具体的,用户可输入模型配置信息对初始化的检测模型中的参数值进行配置,以得到多个错误检测模型,初始化的检测模型中包括长短期记忆网络(LSTM,Long Short-Term Memory)权重层及状态转移矩阵,模型配置信息中可包括模型配置数量、权重层配置信息及转移矩阵配置信息,通过转移矩阵配置信息对初始化的检测模型中的状态转移矩阵进行配置、通过权重层配置信息对初始化的检测模型中的权重层进行配置,并根据模型配置数量创建得到多个错误检测模型,则每一错误检测模型均包括一个长短期记忆网络、一个配置后的权重层及一个配置后的状态转移矩阵。Specifically, the user can input model configuration information to configure the parameter values in the initialized detection model to obtain multiple error detection models. The initialized detection model includes the Long Short-Term Memory (LSTM, Long Short-Term Memory) weight layer And the state transition matrix, the model configuration information can include the number of model configurations, weight layer configuration information, and transition matrix configuration information. The state transition matrix in the initialized detection model is configured through the transition matrix configuration information, and the initialization is initialized through the weight layer configuration information. Configure the weight layer in the detection model, and create multiple error detection models according to the number of model configurations. Each error detection model includes a long and short-term memory network, a configured weight layer, and a configured state transition matrix.
其中,长短期记忆网络用于对输入的文本信息进行计算以获取记忆网络输出信息,权重层用于对记忆网络输出信息进行加权计算得到加权后的记忆网络输出信息,状态转移矩阵用于对加权后的记忆网络输出信息进行状态转移处理得到模型检测信息,通过对模型检测信息进行分析即可获取与该文本信息对应的文件检测结果。Among them, the long and short-term memory network is used to calculate the input text information to obtain the memory network output information, the weight layer is used to weight the memory network output information to obtain the weighted memory network output information, and the state transition matrix is used to weight The output information of the subsequent memory network undergoes state transition processing to obtain model detection information, and the file detection result corresponding to the text information can be obtained by analyzing the model detection information.
S120、根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型。S120: Training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database to obtain a plurality of error detection models after training.
在使用多个错误检测模型对中文文本进行错误检测之前,还需分别对多个错误检测模型进行训练,具体的,可根据转换词典及训练语料数据库对多个错误检测模型进行训练,在训练过程中还需使用到预存的损失函数计算公式及梯度计算公式,训练语料数据库中包含多条训练语料,每一条训练语句由语料信息及目标检测信息所组成,语料信息即为语料文本信息,目标检测信息即为与语料信息对应的正确检测信息,目标检测信息可基于语料信息进行人工 判断并对应添加得到。Before using multiple error detection models to perform error detection on Chinese text, multiple error detection models need to be trained separately. Specifically, multiple error detection models can be trained according to the conversion dictionary and training corpus database. Pre-stored loss function calculation formulas and gradient calculation formulas are also used in the training corpus. The training corpus database contains multiple training corpora. Each training sentence is composed of corpus information and target detection information. The corpus information is the corpus text information, and the target detection The information is the correct detection information corresponding to the corpus information, and the target detection information can be obtained by artificial judgment based on the corpus information and correspondingly added.
在一实施例中,如图2所示,步骤S120包括子步骤S121、S122及S123。In one embodiment, as shown in FIG. 2, step S120 includes sub-steps S121, S122, and S123.
S121、根据所述错误检测模型的数量对所述训练语料数据库的训练语料进行随机分配,得到与所述数量相同的多个训练语料集合。S121. Randomly allocate training corpus of the training corpus database according to the number of error detection models to obtain multiple training corpus sets having the same number.
确定错误检测模型的数量,根据数量对训练语料数据库中的多条训练语料进行随机分配,得到多个训练语料集合,一个训练语料集合所包含的训练语料即可完成对一个错误检测模型进行训练。Determine the number of error detection models, randomly allocate multiple training corpora in the training corpus database according to the number, and obtain multiple training corpus sets. The training corpus contained in one training corpus set can complete the training of an error detection model.
例如,错误检测模型的数量为10个,训练语料数据库中训练语料的数量为2000条,对训练语料进行随机分配,得到10个训练语料集合,每一训练语料集合中包含200条训练语料。For example, the number of error detection models is 10, the number of training corpus in the training corpus database is 2000, and the training corpus is randomly allocated to obtain 10 training corpus sets, and each training corpus set contains 200 training corpora.
S122、根据所述转换词典对多个所述训练语料集合中的语料信息进行转换,得到与每一所述语料信息对应的一条语料编码。S122: Convert the corpus information in the multiple training corpus sets according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information.
根据所述转换词典对多个所述训练语料集合中的语料信息进行转换,得到与每一所述语料信息对应的一条语料编码。每一字符均可在转换词典中匹配到对应的一个特征码,特征码为一个1×M维的向量,则可根据转换词典将训练语料集合中每一语料信息所包含的字符进行转换,将每一字符对应的特征码进行组合得到对应的语料编码,所得到的语料编码将该语料信息的特征采用向量方式进行表示,语料编码的大小为(N,M),其表示语料编码为N行M列的特征向量,语料编码的长度为N(如N=39),语料信息的特征码作为数值填充得到语料编码,若语料信息中的字符数大于N,则截取前N个字符并对应转换成N个1×M维的向量,若语料信息中字符数小于N,则将不足的字符采用“0”作为特征码进行补齐后对应转换得到N个1×M维的向量。The corpus information in the plurality of training corpus sets is converted according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information. Each character can be matched to a corresponding feature code in the conversion dictionary. The feature code is a 1×M-dimensional vector. The characters contained in each corpus information in the training corpus can be converted according to the conversion dictionary. The feature code corresponding to each character is combined to obtain the corresponding corpus code. The obtained corpus code represents the feature of the corpus information in a vector manner. The size of the corpus code is (N, M), which means that the corpus code is N lines The feature vector of column M, the length of the corpus code is N (such as N=39), and the feature code of the corpus information is filled as a value to obtain the corpus code. If the number of characters in the corpus information is greater than N, the first N characters are intercepted and converted accordingly N 1×M-dimensional vectors are formed. If the number of characters in the corpus information is less than N, the insufficient characters are filled with "0" as the feature code, and then correspondingly converted to obtain N 1×M-dimensional vectors.
S123、根据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,得到与每一所述训练语料集合对应的一个训练后的错误检测模型,其中,所述训练语料集合中每一语料编码对应一条预置的目标检测信息。S123: Perform iterative training on one of the error detection models corresponding to each of the training corpus sets according to the plurality of training corpus sets to obtain a trained error detection model corresponding to each of the training corpus sets. Wherein, each corpus code in the training corpus corresponds to a piece of preset target detection information.
根据一个训练语料集合中的语料编码及与每一语料编码对应的目标检测信息对错误检测模型进行迭代训练,并结合预存的损失函数计算公式及梯度计算公式对该错误检测模型进行迭代训练,直至训练语料集合中的语料编码及目标检测信息被全部用于模型训练后停止训练,即可得到一个训练后的错误检测模型。The error detection model is iteratively trained according to the corpus codes in a training corpus set and the target detection information corresponding to each corpus code, and the error detection model is iteratively trained in combination with the pre-stored loss function calculation formula and gradient calculation formula until The corpus encoding and target detection information in the training corpus are all used for model training and then the training is stopped, and a trained error detection model can be obtained.
在一实施例中,如图3所示,步骤S123包括子步骤S1231、S1232、S1233及S1234。In one embodiment, as shown in FIG. 3, step S123 includes sub-steps S1231, S1232, S1233, and S1234.
S1231、将一个所述训练语料集合中的一条语料编码输入所述错误检测模型获取与所述语料编码对应的训练检测信息。S1231. Input a piece of corpus code in one of the training corpus sets into the error detection model to obtain training detection information corresponding to the corpus code.
具体的,一条语料编码即为一个N×M维的向量,计算某一语料编码的记忆网络输出信息可包含以下四个步骤,①计算遗忘门输出信息:f(t)=σ(W f×h(t_1)+U f×X(t)+b f),其中f(t)为遗忘门参数值,0≤f(t)≤1;σ为激活函数计算符号,σ可具体表示为f(x)=(e x-e (-x))/(e x+e (-x)),则将W f×h(t_1)+U f×X(t)+b f的计算结果作为x输入激活函数σ即可计算得到f(t);W f、U f及b f均为本细胞中公式的参数值;h(t_1)为上一细胞的输出门信息;X(t)为该语料编码中输入当前细胞的1×M维的向量,若当前细胞为长短期记忆网络中的第一 个细胞,则h(t_1)为零。②计算输入门信息:i(t)=σ(W i×h(t_1)+U i×X(t)+b i);a(t)=tanh(W a×h(t-1)+U a×X(t)+b a),其中i(t)为输入门参数值,0≤i(t)≤1;W i、U i、b i、W a、U a及b a均为本细胞中公式的参数值,a(t)为所计算得到的输入门向量值,a(t)为一个1×M维的向量。③更新细胞记忆信息:C(t)=C(t_1)⊙f(t)+i(t)⊙a(t),C为每一次计算过程所累计的细胞记忆信息,C(t)为当前细胞所输出的细胞记忆信息,C(t_1)为上一细胞所输出的细胞记忆信息,⊙为向量运算符,C(t_1)⊙f(t)的计算过程为将向量C(t_1)中每一维度值分别与f(t)相乘,所计算的得到的向量维度与向量C(t_1)中的维度相同。④计算输出门信息:o(t)=σ(W o×h(t_1)+U o×X(t)+b o);h(t)=o(t)⊙tanh(C(t)),o(t)为输出门参数值,0≤o(t)≤1;W o、U o及b o均为本细胞中公式的参数值,h(t)为本细胞的输出门信息,h(t)为一个1×M维的向量。每一个细胞均计算得到一个输出信息h(t),综合N个细胞的输出信息即可得到一条语料编码的记忆网络输出信息S,一条语料编码的记忆网络输出信息为一个N×M维的向量,权重层中所包含权重值的数量与M相等,将一条语料编码的记忆网络输出信息与权重层进行相乘(记忆网络输出信息中每一个1×M维向量均与权重层相乘),即可得到附加权重值的记忆网络输出信息P,其中P ij为语料编码对应的语料信息中第i个字符与第j个错误类型标签的发射分数,i∈[1,N]且i为正整数,j∈[1,M]且j为正整数,M为语料编码对应的语料信息中字符对应的错误类型标签的数量,例如,i=1且j=1,则P 11表示第i个字符与替换错误的错误类型标签对应一个发射分数,j=2,则P 12表示第i个字符与插入错误的错误类型标签对应一个发射分数。错误检测模型中还包括状态转移矩阵A,A为一个M×M大小的矩阵,A rt为第r个错误类型标签转移到第t个错误类型标签的转移分数,其中,r∈[1,M],t∈[1,M];若r=t,则A rt=0。附加权重值的记忆网络输出信息P与状态转移矩阵A即为训练检测信息。 Specifically, a corpus code is an N×M-dimensional vector. Calculating the memory network output information of a corpus code can include the following four steps: ①Calculate the output information of the forgetting gate: f(t)=σ(W f × h(t_1)+U f ×X(t)+b f ), where f(t) is the parameter value of the forget gate, 0≤f(t)≤1; σ is the activation function calculation symbol, and σ can be specifically expressed as f (x)=(e x -e (-x) )/(e x + e (-x) ), then the calculation result of W f ×h(t_1)+U f ×X(t)+b f is taken as x input activation function σ to calculate f(t); W f , U f and b f are the parameter values of the formula in this cell; h(t_1) is the output gate information of the previous cell; X(t) is Input the 1×M-dimensional vector of the current cell into the corpus encoding. If the current cell is the first cell in the long-short-term memory network, h(t_1) is zero. ②Calculate the input gate information: i(t)=σ(W i ×h(t_1)+U i ×X(t)+b i ); a(t)=tanh(W a ×h(t-1)+ U a × X (t) + b a), where i (t) is a parameter value input gate, 0≤i (t) ≤1; W i, U i, b i, W a, U a and b a are It is the parameter value of the formula in this cell, a(t) is the calculated input gate vector value, and a(t) is a 1×M-dimensional vector. ③Update cell memory information: C(t)=C(t_1)⊙f(t)+i(t)⊙a(t), C is the accumulated cell memory information in each calculation process, C(t) is the current The cell memory information output by the cell, C(t_1) is the cell memory information output by the previous cell, ⊙ is the vector operator, and the calculation process of C(t_1)⊙f(t) is the calculation process of each in the vector C(t_1) The one-dimensional value is respectively multiplied by f(t), and the calculated vector dimension is the same as the dimension in the vector C(t_1). ④Calculate and output gate information: o(t)=σ(W o ×h(t_1)+U o ×X(t)+b o ); h(t)=o(t)⊙tanh(C(t)) , O(t) is the output gate parameter value, 0≤o(t)≤1; W o , U o and b o are the parameter values of the formula in the cell, h(t) is the output gate information of the cell, h(t) is a 1×M-dimensional vector. Each cell is calculated to obtain an output information h(t), and the output information of N cells can be combined to obtain a corpus-encoded memory network output information S, and a corpus-encoded memory network output information is an N×M-dimensional vector , The number of weight values contained in the weight layer is equal to M, and the output information of a corpus-encoded memory network is multiplied by the weight layer (each 1×M dimensional vector in the output information of the memory network is multiplied by the weight layer), The output information P of the memory network with additional weight value can be obtained, where P ij is the emission score of the i-th character and the j-th error type label in the corpus information corresponding to the corpus encoding, i∈[1,N] and i is positive integer, j∈ [1, M] and j is a positive integer, M being the number of the wrong type of tag corpus corpus encoded information corresponding to the character corresponding to, for example, i = 1 and j = 1, P 11 represents the i-th The character and the error type label of the replacement error correspond to a transmission score, j=2, then P 12 indicates that the i-th character and the error type label of the insertion error correspond to a transmission score. The error detection model also includes a state transition matrix A, A is a matrix of size M×M, A rt is the transition score of the r-th error type label to the t-th error type label, where r∈[1,M ], t∈[1,M]; if r=t, then A rt =0. The output information P of the memory network and the state transition matrix A with the weight value added are the training detection information.
S1232、根据预存的损失函数计算公式计算所述训练检测信息与所述语料编码的目标检测信息之间的损失值。S1232: Calculate the loss value between the training detection information and the target detection information encoded by the corpus according to the pre-stored loss function calculation formula.
具体的,损失函数计算公式可采用公式(1)进行表示:Specifically, the loss function calculation formula can be expressed by formula (1):
Figure PCTCN2021083936-appb-000001
Figure PCTCN2021083936-appb-000001
其中,L为计算得到的损失值,S(X,Y)为目标检测信息的得分,S(X,Y')为训练检测信息的得分,X为输入的语料编码,Y为目标检测信息包含的错误类型标签,Y'为训练检测信息包含的错误类型标签,Y X为所有可能的错误类型标签。具体的,得分可采用公式(2)进行计算得到; Among them, L is the calculated loss value, S(X, Y) is the score of the target detection information, S(X, Y') is the score of the training detection information, X is the input corpus code, and Y is the target detection information contains The error type label of Y'is the error type label contained in the training detection information, and Y X is the label of all possible error types. Specifically, the score can be calculated using formula (2);
Figure PCTCN2021083936-appb-000002
Figure PCTCN2021083936-appb-000002
S1233、根据预存的梯度计算公式、所述损失值及所述训练检测信息的计算值计算得到所述错误检测模型中转移矩阵的更新值,更新所述转移矩阵的参数值。S1233: Calculate the updated value of the transition matrix in the error detection model according to the pre-stored gradient calculation formula, the loss value and the calculated value of the training detection information, and update the parameter value of the transition matrix.
根据梯度计算公式、所计算得到的损失值及训练检测信息的计算值计算得到转移矩阵的更新值,并通过更新值对转移矩阵中的参数值进行更新,这一对错误检测模型进行训练的过程也即为梯度下降计算。According to the gradient calculation formula, the calculated loss value and the calculated value of the training detection information, the updated value of the transition matrix is calculated, and the parameter value in the transition matrix is updated through the updated value. This is the process of training the error detection model That is, gradient descent calculation.
具体的,梯度计算公式可采用公式(3)进行表示:Specifically, the gradient calculation formula can be expressed by formula (3):
Figure PCTCN2021083936-appb-000003
Figure PCTCN2021083936-appb-000003
其中,
Figure PCTCN2021083936-appb-000004
为计算得到的某一转移分数的更新值,ω t为转移分数的原始参数值,γ为梯度计算公式中预置的学习率,
Figure PCTCN2021083936-appb-000005
为基于损失值及转移分数对应的计算值(训练检测信息的计算值中相邻两个错误类型标签的发射得分之差)对该转移分数的偏导值。
in,
Figure PCTCN2021083936-appb-000004
Is the calculated update value of a certain transition score, ω t is the original parameter value of the transition score, and γ is the preset learning rate in the gradient calculation formula,
Figure PCTCN2021083936-appb-000005
It is the partial derivative value of the transfer score based on the calculated value corresponding to the loss value and the transfer score (the difference between the emission scores of two adjacent error type tags in the calculated value of the training detection information).
S1234、获取所述训练语料集合中的下一条语料编码信息输入所述错误检测模型并重复上述步骤,直至所述训练语料集合包含的所有语料编码信息全部用于训练。S1234. Obtain the next piece of corpus coding information in the training corpus and input it into the error detection model and repeat the above steps until all corpus coding information included in the training corpus is used for training.
对错误检测模型中的转移矩阵进行一次更新也即是对错误检测模型进行了一次训练,可根据上述过程对错误检测模型进行多次迭代训练。To update the transition matrix in the error detection model once is to train the error detection model once. The error detection model can be trained for multiple iterations according to the above process.
S130、若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息。S130: If the to-be-detected text input by the user is received, input the to-be-detected text into multiple error detection models to obtain corresponding multiple model detection information.
若接收到用户所输入的待检测文本,可将待检测文本分别输入每一个错误检测模型进行计算,对应获取得到每一错误检测模型的模型检测信息,具体的,将待检测文本输入一个错误检测模型,可得到该模型的输出信息,模型的输出信息包括附加权重值的记忆网络输出信息与状态转移矩阵,根据状态转移矩阵对附加权重值的记忆网络输出信息进行状态转移,得到模型检测信息。If the text to be detected input by the user is received, the text to be detected can be input into each error detection model for calculation, and the model detection information of each error detection model is obtained correspondingly. Specifically, the text to be detected is input into an error detection Model, the output information of the model can be obtained. The output information of the model includes the output information of the memory network with the additional weight value and the state transition matrix. According to the state transition matrix, the output information of the memory network with the additional weight value is transferred to obtain the model detection information.
在一实施例中,如图4所示,步骤S130包括子步骤S131及S132。In one embodiment, as shown in FIG. 4, step S130 includes sub-steps S131 and S132.
S131、根据所述转换词典将所述待检测文本转换为对应的文本编码;S132、将所述文本编码分别输入多个所述错误检测模型进行计算,以获取每一所述错误检测模型输出的模型检测信息。S131. Convert the to-be-detected text into a corresponding text code according to the conversion dictionary; S132. Input the text code into a plurality of the error detection models for calculation, so as to obtain the output of each error detection model Model checking information.
具体的,可通过上述转换词典获取待检测文本对应的文本编码,待检测文本为一句中文文本信息,文本编码即为对待检测文本进行转换所得到的一个N×M维的向量,将文本编码分别输入多个错误检测模型进行计算,具体计算过程与对语料编码进行计算的过程相同,在此不作赘述。将一个错误检测模型的状态转移矩阵对应累加至该错误检测模型附加权重值的记忆网络输出信息中,即可实现对附加权重值的记忆网络输出信息进行状态转移,得到相应模型检测信息。一个错误检测模型计算得到的模型检测信息为N×M维的向量,N表示字符总数,M表示错误类型标签数,用于表示待检测文本中每一字符与每一错误类型标签对应的分数值。Specifically, the text code corresponding to the text to be detected can be obtained through the above conversion dictionary. The text to be detected is a sentence of Chinese text information. The text code is an N×M-dimensional vector obtained by converting the text to be detected. The text codes are respectively Input multiple error detection models for calculation. The specific calculation process is the same as the calculation process for corpus encoding, so I will not repeat it here. The state transition matrix of an error detection model is correspondingly accumulated to the memory network output information of the error detection model with the additional weight value, and the state transition of the memory network output information with the additional weight value can be realized, and the corresponding model detection information can be obtained. The model detection information calculated by an error detection model is an N×M-dimensional vector, where N represents the total number of characters, and M represents the number of error type tags, which is used to represent the score value of each character in the text to be detected and each error type tag .
S140、从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据。S140. Filtering a plurality of the model detection information to obtain screening detection data that meets a preset screening condition.
在对所得到的多个模型检测信息进行分析的过程中,首先需要对模型检测信息中的数据进行筛选,可从模型检测信息中筛选得到满足预置筛选条件的筛选检测数据,具体的,所述筛选条件包括检测数据筛选比例及模型筛选比例,所述筛选检测数据包含多组检测数据信息。In the process of analyzing the obtained multiple model checking information, the data in the model checking information needs to be screened first. The screening checking data that meets the preset screening conditions can be filtered from the model checking information. Specifically, The screening conditions include detection data screening ratio and model screening ratio, and the screening detection data includes multiple sets of detection data information.
在一实施例中,如图5所示,步骤S140包括子步骤S141、S142及S143。In one embodiment, as shown in FIG. 5, step S140 includes sub-steps S141, S142, and S143.
S141、获取每一模型检测信息的综合检测分数,并根据所述综合检测分数对多个模型检测信息进行排序。S141. Obtain a comprehensive detection score of each model detection information, and sort a plurality of model detection information according to the comprehensive detection score.
计算每一模型检测信息的综合检测分数,对一个模型检测信息中所有的分数值进行累加, 即可得到该模型检测信息的综合检测分数,根据综合检测分数对模型检测信息进行排序。Calculate the comprehensive detection score of each model detection information, accumulate all the score values in a model detection information to obtain the comprehensive detection score of the model detection information, and sort the model detection information according to the comprehensive detection score.
S142、根据所述模型筛选比例截取排序靠前的多个模型检测信息,作为多组备选模型检测信息。S142. Cut the detection information of multiple models with the highest ranking according to the model screening ratio, and use them as multiple sets of candidate model detection information.
根据模型筛选比例截取综合检测分数排序靠前的多个模型检测信息,得到多组备选模型检测信息。例如,模型筛选比例为80%,模型检测信息为10个,将综合检测分数排序倒数的两个模型检测信息进行剔除,得到剩余8个备选模型检测信息。According to the model screening ratio, the detection information of multiple models with the highest ranking of comprehensive detection scores is intercepted, and multiple sets of candidate model detection information are obtained. For example, if the model screening ratio is 80%, and the model detection information is 10, the two model detection information of the inverse order of the comprehensive detection score are eliminated, and the remaining 8 candidate model detection information are obtained.
S143、根据所述检测数据筛选比例对每一组所述备选模型检测信息进行筛选,获取每一组备选模型检测信息中排序靠前的多个检测数据作为一组所述检测数据信息。S143. Filter each group of candidate model detection information according to the detection data screening ratio, and obtain a plurality of detection data ranked higher in each group of candidate model detection information as a set of detection data information.
根据检测数据筛选比例对每一组备选模型检测信息进行筛选,获取排序靠前的多个检测数据作为该组备选模型检测信息的检测数据信息。例如,一组备选模型检测信息中包含N×M个分数值,检测数据筛选比例为40%,则获取N×M个分数值中前40%的分数值进行保留,将保留的40%分数值作为该组备选模型检测信息的检测数据信息。Each group of candidate model detection information is screened according to the detection data screening ratio, and multiple detection data ranked at the top are obtained as the detection data information of the group of candidate model detection information. For example, if a set of candidate model detection information contains N×M score values, and the detection data screening ratio is 40%, then the first 40% of the N×M score values will be obtained and retained, and the retained 40% will be retained. The value is used as the detection data information of the candidate model detection information.
S150、对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。S150. Perform integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
可对筛选检测数据进行集成处理,得到与待检测文本相匹配的文本检测结果。具体的,筛选检测数据包含多组检测数据信息,可根据每一组检测数据信息确定对应的文本错误位置及文本错误类型,并对多组检测数据信息的文本错误位置及文本错误类型进行集成处理,得到统一的文本错误位置及文本错误类型作为待检测文本的文本检测结果。The screening test data can be integrated and processed to obtain a text detection result that matches the text to be detected. Specifically, the screening detection data contains multiple sets of detection data information, the corresponding text error location and text error type can be determined according to each set of detection data information, and the text error location and text error type of multiple sets of detection data information can be integrated processing , Get the unified text error location and text error type as the text detection result of the text to be detected.
此外,可将所得到的文本检测结果上传至区块链中进行存储。具体的,可基于文本检测结果得到对应的摘要信息,具体来说,摘要信息由文本检测结果进行散列处理得到,比如利用sha256s算法处理得到。将摘要信息上传至区块链可保证其安全性和对用户的公正透明性。用户可通过用户终端从区块链中下载得该摘要信息,以便查证文本检测结果是否被篡改。本示例所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。In addition, the obtained text detection results can be uploaded to the blockchain for storage. Specifically, the corresponding summary information may be obtained based on the text detection result. Specifically, the summary information is obtained by hashing the text detection result, for example, obtained by the sha256s algorithm. Uploading summary information to the blockchain can ensure its security and fairness and transparency to users. The user can download the summary information from the blockchain through the user terminal to verify whether the text detection result has been tampered with. The blockchain referred to in this example is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
在一实施例中,如图6所示,步骤S150包括子步骤S151、S152、S153、S154、S155及S156。In an embodiment, as shown in FIG. 6, step S150 includes sub-steps S151, S152, S153, S154, S155, and S156.
S151、根据筛选检测数据中的每一组检测数据信息确定每一组检测数据信息的文本错误位置及文本错误类型。S151: Determine the text error location and text error type of each group of detection data information according to each group of detection data information in the screening detection data.
具体的,获取一组检测数据信息中一个字符的多个分数值,获取该字符多个分数值中最高的一个分数值对应的错误类型标签作为该组检测数据信息的文本错误类型,若某一字符在检测数据信息中不包含分数值,则认为该字符不包含错误,根据多个字符的文本错误类型确定该组检测数据信息对应的文本错误位置,获取与文本错误类型相匹配的所有字符在待检测文本中的位置,作为该组检测数据信息的文本错误位置。Specifically, multiple score values of a character in a group of detection data information are obtained, and the error type label corresponding to the highest score value among the multiple score values of the character is obtained as the text error type of the group of detection data information. If a character does not contain a score value in the detection data information, it is considered that the character does not contain an error. According to the text error type of multiple characters, the text error position corresponding to the group of detection data information is determined, and all the characters that match the text error type are obtained. The position in the text to be detected is used as the text error position of the group of detected data information.
S152、判断所有所述文本错误位置是否均相同。S152: Determine whether all the text error positions are the same.
判断所有组检测数据信息的文本错误位置是否均相同,例如,某一组检测数据信息的文本错误位置为字符5至字符9,另一组检测数据信息的文本错误位置为为字符6至字符10,则两组检测数据信息的文本错误位置不相同。Determine whether the text error positions of all groups of detected data information are the same. For example, the text error position of a certain group of detected data information is character 5 to character 9, and the text error position of another group of detected data information is character 6 to character 10. , The text error positions of the two groups of detected data information are not the same.
S153、若所有所述文本错误位置不均相同,根据预置的判断规则判断所有所述文本错误位置是否包含单词切分错误;S154、将包含单词切分错误的文本错误位置进行删除并返回执行所述判断所有所述文本错误位置是否均相同的步骤。S153. If all the text error positions are not the same, judge whether all the text error positions contain a word segmentation error according to a preset judgment rule; S154, delete the text error positions that contain a word segmentation error and return to execution The step of judging whether all the text error positions are the same.
具体的,可获取一组检测数据信息的文本错误位置,可根据判断规则判断该文本错误位置中是否包含被切分的词组,若文本错误位置中包含被切分的词组,则判断得到该文本错误位置包含单词切分错误。Specifically, a set of text error positions of the detected data information can be obtained, and it can be judged according to the judgment rule whether the text error position contains the segmented phrase, and if the text error position contains the segmented phrase, the text is determined to be obtained The error position contains a word segmentation error.
例如,某一组检测数据信息的文本错误位置为字符5至字符9;获取待检测文本中第4个字符、5个字符及第6个字符,根据判断规则判断第4个字符与第5个字符组合成为词组的概率是否大于第5个字符与第6个字符组合成为词组的概率,若大于,则该文本错误位置包含被切分的词组,若不大于,则以同样方法获取待检测文本中第8个字符、9个字符及第10个字符并继续判断是否包含被切分的词组。将包含单词切分错误的文本错误文字进行删除,并再次对剩余的文本错误位置是否均相同进行判断。For example, the text error position of a certain group of detected data information is character 5 to character 9; get the 4th character, 5th character and 6th character in the text to be detected, and judge the 4th character and the 5th character according to the judgment rule Whether the probability of a character combination becoming a phrase is greater than the probability of combining the 5th character and the 6th character to become a phrase, if it is greater, the text error position contains the segmented phrase, if it is not greater, the same method is used to obtain the text to be detected In the 8th character, 9th character and 10th character, continue to determine whether it contains the segmented phrase. Delete the text error text containing the word segmentation error, and again judge whether the remaining text error positions are the same.
S155、若所有所述文本错误位置均相同,判断所有所述文本错误类型是否均相同。S156、若所有所述文本错误类型均相同,将所述文本错误位置及所述文本错误类型作为所述文本检测结果进行输出。S155: If all the text error positions are the same, determine whether all the text error types are the same. S156: If all the text error types are the same, output the text error location and the text error type as the text detection result.
若文本错误位置均相同,判断文本错误位置中每一字符对应的多个文本错误类型是否均相同。若文本错误位置中每一字符对应的多个文本错误类型均相同,则可将当前得到的文本错误位置及文本错误类型作为文本检测结果进行输出。若所有所述文本错误类型不均相同,则可发送无法获取文本检测结果的提示信息以对用户进行提示。If the text error positions are all the same, determine whether the multiple text error types corresponding to each character in the text error position are all the same. If multiple text error types corresponding to each character in the text error position are the same, the currently obtained text error position and text error type can be output as the text detection result. If all the text error types are not the same, a prompt message that the text detection result cannot be obtained can be sent to prompt the user.
在一实施例中,如图7所示,步骤S152之前还包括步骤S1521和S1522。In one embodiment, as shown in FIG. 7, steps S1521 and S1522 are further included before step S152.
S1521、根据预置的语法模板与每一组所述检测数据信息的文本错误位置及文本错误类型进行匹配;S1522、将与所述语法模板相匹配的文本错误位置及文本错误类型进行剔除。S1521, according to a preset grammar template, matches the text error position and text error type of each group of the detected data information; S1522, removes the text error position and text error type that match the grammar template.
具体的,还可根据预置的语法模板对每一组检测数据信息的文本错误位置及文本错误类型进行匹配,若某一组测数据信息的文本错误位置及文本错误类型与任意一个语法模板相匹配,则将该组组测数据信息的文本错误位置及文本错误类型进行剔除。Specifically, the text error position and text error type of each group of detected data information can also be matched according to the preset grammar template. If the text error position and text error type of a certain group of detected data information match any grammar template If it matches, the text error position and text error type of the group of test data information are eliminated.
具体的,预置的语法模板中包含错误检测模型无法检测到的部分语法规则,可预先配置数百个语法模板,并依次判断每一语法模板是否与一组检测数据信息的文本错误位置及文本错误类型相匹配。Specifically, the preset grammar templates include some grammatical rules that cannot be detected by the error detection model. Hundreds of grammar templates can be pre-configured, and each grammar template can be judged in turn whether it corresponds to a set of text error positions and texts of the detection data information. The error type matches.
例如“高兴地玩耍”中“高兴”是形容词,“玩耍”是动词,两者需要用“地”来进行连接,若某一组测数据信息的文本错误位置及文本错误类型判断得到“高兴地玩耍”存在语法错误,则该组测数据信息的文本错误位置及文本错误类型与“X地D”(其中X表示形容词,D表示动词)这一语法模板相匹配,将该组检测数据信息的文本错误位置及文本错误类型进行剔除。For example, in "playing happily", "happy" is an adjective, and "playing" is a verb. The two need to be connected with "地". If the text error position and text error type of a certain set of measured data information are judged to be "happy If there is a grammatical error in "Playing", the text error location and text error type of this group of test data information match the grammatical template "X to D" (where X represents an adjective, and D represents a verb). The text error location and text error type are eliminated.
本申请中的技术方法可应用于智慧政务/智慧城管/智慧社区/智慧安防/智慧物流/智慧医疗/智慧教育/智慧环保/智慧交通等包含对中文文本进行错误检测的应用场景中,从而推动智慧城市的建设。The technical methods in this application can be applied to smart government affairs/smart city management/smart communities/smart security/smart logistics/smart healthcare/smart education/smart environmental protection/smart transportation and other application scenarios that include error detection of Chinese texts, thereby promoting The construction of smart cities.
在本申请实施例所提供的基于人工智能的文本错误检测方法中,根据模型配置信息及初始化的检测模型构建得到多个错误检测模型;对多个错误检测模型分别进行训练,将待检测文本输入训练后的多个错误检测模型获取多个模型检测信息,从模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;对筛选检测数据进行集成处理得到文本检测结果。通过上述方法,通过构建多个错误检测模型分别获取与待检测文本对应的多个模型检测信息,对模型检测信息进行筛选并集成处理得到文本检测结果,相比采用固定模板匹配的方式,可大幅提升对中文文本进行错误检测的效率及准确性。In the artificial intelligence-based text error detection method provided by the embodiments of the present application, multiple error detection models are constructed according to the model configuration information and the initialized detection model; the multiple error detection models are trained separately, and the text to be detected is input The trained multiple error detection models acquire multiple model detection information, and filter the model detection information to obtain the screening detection data that meets the preset screening conditions; perform integrated processing on the screening detection data to obtain the text detection result. Through the above method, multiple error detection models are constructed to obtain multiple model detection information corresponding to the text to be detected, and the model detection information is filtered and integrated to obtain the text detection result. Compared with the fixed template matching method, it can be greatly improved. Improve the efficiency and accuracy of error detection for Chinese text.
本申请实施例还提供一种基于人工智能的文本错误检测装置,该基于人工智能的文本错误检测装置用于执行前述基于人工智能的文本错误检测方法的任一实施例。具体地,请参阅图8,图8是本申请实施例提供的文本错误检测装置的示意性框图。该基于人工智能的文本错误检测装置可以配置于用户终端中。The embodiment of the present application also provides an artificial intelligence-based text error detection device, which is used to execute any embodiment of the aforementioned artificial intelligence-based text error detection method. Specifically, please refer to FIG. 8, which is a schematic block diagram of a text error detection apparatus provided by an embodiment of the present application. The artificial intelligence-based text error detection device can be configured in a user terminal.
如图8所示,基于人工智能的文本错误检测装置100包括检测模型配置单元110、检测模型训练单元120、模型检测信息获取单元130、模型检测信息筛选单元140和集成处理单元150。As shown in FIG. 8, the text error detection device 100 based on artificial intelligence includes a detection model configuration unit 110, a detection model training unit 120, a model detection information acquisition unit 130, a model detection information screening unit 140 and an integrated processing unit 150.
检测模型配置单元110,用于接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型。The detection model configuration unit 110 is configured to receive model configuration information input by a user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models.
检测模型训练单元120,用于根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型。The detection model training unit 120 is configured to train multiple error detection models according to a preset conversion dictionary and a pre-stored training corpus database to obtain multiple error detection models after training.
在一实施例中,所述检测模型训练单元120包括子单元:训练语料集合获取单元、语料编码获取单元及迭代训练单元。In an embodiment, the detection model training unit 120 includes sub-units: a training corpus acquisition unit, a corpus code acquisition unit, and an iterative training unit.
训练语料集合获取单元,用于根据所述错误检测模型的数量对所述训练语料数据库的训练语料进行随机分配,得到与所述数量相同的多个训练语料集合;语料编码获取单元,用于根据所述转换词典对多个所述训练语料集合中的语料信息进行转换,得到与每一所述语料信息对应的一条语料编码;迭代训练单元,用于据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,得到与每一所述训练语料集合对应的一个训练后的错误检测模型,其中,所述训练语料集合中每一语料编码对应一条预置的目标检测信息。The training corpus acquisition unit is used to randomly allocate the training corpus of the training corpus database according to the number of the error detection models to obtain multiple training corpus sets with the same number; the corpus code acquisition unit is used to The conversion dictionary converts the corpus information in the multiple training corpus sets to obtain a piece of corpus information corresponding to each of the corpus information; One of the error detection models corresponding to the training corpus is iteratively trained to obtain a trained error detection model corresponding to each of the training corpus sets, wherein each corpus code in the training corpus corresponds to A preset target detection information.
在一实施例中,所述迭代训练单元包括子单元:训练检测信息获取单元、损失值计算单元、转移矩阵参数更新单元和重复单元。In an embodiment, the iterative training unit includes subunits: a training detection information acquisition unit, a loss value calculation unit, a transition matrix parameter update unit, and a repeat unit.
训练检测信息获取单元,用于将一个所述训练语料集合中的一条语料编码输入所述错误检测模型以获取与所述语料编码对应的训练检测信息;损失值计算单元,用于根据预存的损失函数计算公式计算所述训练检测信息与所述语料编码的目标检测信息之间的损失值;转移矩阵参数更新单元,用于根据预存的梯度计算公式、所述损失值及所述训练检测信息的计算 值计算得到所述错误检测模型中转移矩阵的更新值,更新所述转移矩阵的参数值;重复单元,用于获取所述训练语料集合中的下一条语料编码信息输入所述错误检测模型并重复上述步骤,直至所述训练语料集合包含的所有语料编码信息全部用于训练。The training detection information acquisition unit is used to input a corpus code in the training corpus set into the error detection model to obtain training detection information corresponding to the corpus code; The function calculation formula calculates the loss value between the training detection information and the target detection information encoded by the corpus; the transition matrix parameter update unit is used to calculate the loss value according to the pre-stored gradient calculation formula, the loss value and the training detection information The calculated value is calculated to obtain the updated value of the transition matrix in the error detection model, and the parameter value of the transition matrix is updated; the repeat unit is used to obtain the next piece of corpus encoding information in the training corpus set and input the error detection model and Repeat the above steps until all the corpus coding information included in the training corpus set is used for training.
模型检测信息获取单元130,用于若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息。The model checking information acquiring unit 130 is configured to, if the to-be-detected text input by the user is received, input the to-be-detected text into a plurality of error detection models respectively to obtain a plurality of corresponding model detection information.
在一实施例中,所述模型检测信息获取单元130包括子单元:文本编码获取单元及文本编码计算单元。In an embodiment, the model detection information acquiring unit 130 includes sub-units: a text encoding acquiring unit and a text encoding calculating unit.
文本编码获取单元,用于根据所述转换词典将所述待检测文本转换为对应的文本编码;文本编码计算单元,用于将所述文本编码分别输入多个所述错误检测模型进行计算,以获取每一所述错误检测模型输出的模型检测信息。The text code acquisition unit is configured to convert the text to be detected into a corresponding text code according to the conversion dictionary; the text code calculation unit is configured to input the text code into multiple error detection models for calculation, respectively Obtain model detection information output by each of the error detection models.
模型检测信息筛选单元140,用于从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据。The model checking information screening unit 140 is configured to obtain screening and checking data satisfying preset screening conditions from a plurality of the model checking information.
在一实施例中,所述模型检测信息筛选单元140包括子单元:模型检测信息排序单元、模型检测信息截取单元及检测数据筛选单元。In an embodiment, the model detection information screening unit 140 includes subunits: a model detection information ranking unit, a model detection information intercepting unit, and a detection data screening unit.
模型检测信息排序单元,用于获取每一模型检测信息的综合检测分数,并根据所述综合检测分数对多个模型检测信息进行排序;模型检测信息截取单元,用于根据所述模型筛选比例截取排序靠前的多个模型检测信息,作为多组备选模型检测信息;检测数据筛选单元,用于根据所述检测数据筛选比例对每一组所述备选模型检测信息进行筛选,获取每一组备选模型检测信息中排序靠前的多个检测数据作为一组所述检测数据信息。The model checking information sorting unit is used to obtain the comprehensive detection score of each model checking information, and to sort a plurality of model checking information according to the comprehensive detection score; the model checking information intercepting unit is used to intercept the selection ratio according to the model The multiple model detection information ranked at the top are used as multiple sets of candidate model detection information; the detection data screening unit is used to screen each set of candidate model detection information according to the detection data screening ratio, and obtain each The plurality of detection data ranked higher in the group of candidate model detection information is used as a group of the detection data information.
集成处理单元150,用于对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。The integrated processing unit 150 is configured to perform integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
在一实施例中,所述集成处理单元150包括子单元:错误位置类型确定单元、文本错误位置判断单元、单词切分错误判断单元、删除单元、文本错误类型判断单元及文本检测结果获取单元。In an embodiment, the integrated processing unit 150 includes sub-units: an error location type determination unit, a text error location judgment unit, a word segmentation error judgment unit, a deletion unit, a text error type judgment unit, and a text detection result acquisition unit.
错误位置类型确定单元,用于根据筛选检测数据中的每一组检测数据信息确定每一组检测数据信息的文本错误位置及文本错误类型;文本错误位置判断单元,用于判断所有所述文本错误位置是否均相同;单词切分错误判断单元,用于若所有所述文本错误位置不均相同,根据预置的判断规则判断所有所述文本错误位置是否包含单词切分错误;删除单元,用于将包含单词切分错误的文本错误位置进行删除并返回执行所述判断所有所述文本错误位置是否均相同的步骤;文本错误类型判断单元,用于若所有所述文本错误位置均相同,判断所有所述文本错误类型是否均相同;文本检测结果获取单元,用于若所有所述文本错误类型均相同,将所述文本错误位置及所述文本错误类型作为所述文本检测结果进行输出。The error location type determining unit is used to determine the text error location and text error type of each group of detection data information according to each group of detection data information in the screening detection data; the text error location judgment unit is used to judge all the text errors Whether the positions are all the same; the word segmentation error judgment unit is used for determining whether all the text error positions contain word segmentation errors according to preset judgment rules if all the text error positions are not the same; the deletion unit is used for Delete the text error position containing the word segmentation error and return to execute the step of judging whether all the text error positions are the same; the text error type judgment unit is used for judging all the text error positions if all the text error positions are the same Whether the text error types are all the same; the text detection result obtaining unit is configured to output the text error location and the text error type as the text detection result if all the text error types are the same.
在一实施例中,所述集成处理单元150还包括子单元:语法模板匹配单元及剔除单元。In an embodiment, the integrated processing unit 150 further includes sub-units: a grammar template matching unit and a rejection unit.
语法模板匹配单元,用于根据预置的语法模板与每一组所述检测数据信息的文本错误位置及文本错误类型进行匹配;剔除单元,用于将与所述语法模板相匹配的文本错误位置及文本错误类型进行剔除。The grammar template matching unit is used to match the text error position and text error type of each group of the detected data information according to the preset grammar template; the elimination unit is used to match the text error position of the grammar template And text error types are eliminated.
在本申请实施例所提供的基于人工智能的文本错误检测装置应用上述基于人工智能的文本错误检测方法,根据模型配置信息及初始化的检测模型构建得到多个错误检测模型;对多个错误检测模型分别进行训练,将待检测文本输入训练后的多个错误检测模型获取多个模型检测信息,从模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;对筛选检测数据进行集成处理得到文本检测结果。通过上述方法,通过构建多个错误检测模型分别获取与待检测文本对应的多个模型检测信息,对模型检测信息进行筛选并集成处理得到文本检测结果,相比采用固定模板匹配的方式,可大幅提升对中文文本进行错误检测的效率及准确性。The artificial intelligence-based text error detection device provided in the embodiments of the present application applies the above-mentioned artificial intelligence-based text error detection method to construct multiple error detection models according to the model configuration information and the initialized detection model; for multiple error detection models Train separately, input the text to be detected into multiple error detection models after training to obtain multiple model detection information, filter the model detection information to obtain the screening detection data that meets the preset screening conditions; perform integrated processing on the screening detection data to obtain the text Test results. Through the above method, multiple error detection models are constructed to obtain multiple model detection information corresponding to the text to be detected, and the model detection information is filtered and integrated to obtain the text detection result. Compared with the fixed template matching method, it can be greatly improved. Improve the efficiency and accuracy of error detection for Chinese text.
上述文本错误检测装置可以实现为计算机程序的形式,该计算机程序可以在如图9所示的计算机设备上运行。The above text error detection device can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 9.
请参阅图9,图9是本申请实施例提供的计算机设备的示意性框图。该计算机设备可以是用于执行基于人工智能的文本错误检测方法以对中文文本进行错误检测的用户终端。Please refer to FIG. 9, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device may be a user terminal for executing an artificial intelligence-based text error detection method to perform error detection on Chinese text.
参阅图9,该计算机设备500包括通过系统总线501连接的处理器502、存储器和网络接口505,其中,存储器可以包括非易失性存储介质503和内存储器504。Referring to FIG. 9, the computer device 500 includes a processor 502, a memory, and a network interface 505 connected through a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
该非易失性存储介质503可存储操作系统5031和计算机程序5032。该计算机程序5032被执行时,可使得处理器502执行基于人工智能的文本错误检测方法。The non-volatile storage medium 503 can store an operating system 5031 and a computer program 5032. When the computer program 5032 is executed, the processor 502 can execute a text error detection method based on artificial intelligence.
该处理器502用于提供计算和控制能力,支撑整个计算机设备500的运行。The processor 502 is used to provide calculation and control capabilities, and support the operation of the entire computer device 500.
该内存储器504为非易失性存储介质503中的计算机程序5032的运行提供环境,该计算机程序5032被处理器502执行时,可使得处理器502执行基于人工智能的文本错误检测方法。The internal memory 504 provides an environment for the operation of the computer program 5032 in the non-volatile storage medium 503, and when the computer program 5032 is executed by the processor 502, the processor 502 can make the processor 502 execute a text error detection method based on artificial intelligence.
该网络接口505用于进行网络通信,如提供数据信息的传输等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备500的限定,具体的计算机设备500可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。The network interface 505 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 500 to which the solution of the present application is applied. The specific computer device 500 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.
其中,所述处理器502用于运行存储在存储器中的计算机程序5032,以实现上述的基于人工智能的文本错误检测方法中对应的功能。Wherein, the processor 502 is configured to run a computer program 5032 stored in a memory to implement the corresponding function in the above-mentioned artificial intelligence-based text error detection method.
本领域技术人员可以理解,图9中示出的计算机设备的实施例并不构成对计算机设备具体构成的限定,在其他实施例中,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。例如,在一些实施例中,计算机设备可以仅包括存储器及处理器,在这样的实施例中,存储器及处理器的结构及功能与图9所示实施例一致,在此不再赘述。Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 9 does not constitute a limitation on the specific configuration of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged. For example, in some embodiments, the computer device may only include a memory and a processor. In such embodiments, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 9 and will not be repeated here.
应当理解,在本申请实施例中,处理器502可以是中央处理单元(Central Processing Unit,CPU),该处理器502还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。It should be understood that, in this embodiment of the application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), and the processor 502 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.
在本申请的另一实施例中提供计算机可读存储介质。该计算机可读存储介质可以为非易 失性的计算机可读存储介质。该计算机可读存储介质存储有计算机程序,其中计算机程序被处理器执行时实现上述的基于人工智能的文本错误检测方法中所包含的步骤。In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program implements the steps included in the above-mentioned artificial intelligence-based text error detection method when the computer program is executed by the processor.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的设备、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, the specific working process of the above-described equipment, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here. A person of ordinary skill in the art may be aware that the units and algorithm steps of the examples described in the embodiments disclosed herein can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the hardware and software Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered beyond the scope of this application.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为逻辑功能划分,实际实现时可以有另外的划分方式,也可以将具有相同功能的单元集合成一个单元,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。In the several embodiments provided in this application, it should be understood that the disclosed equipment, device, and method may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods, or the units with the same function may be combined into one. Units, for example, multiple units or components can be combined or integrated into another system, or some features can be omitted or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments of the present application.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application is essentially or the part that contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product can be stored in a computer. The read storage medium includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned computer-readable storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), magnetic disk or optical disk and other media that can store program codes.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Anyone familiar with the technical field can easily think of various equivalents within the technical scope disclosed in this application. Modifications or replacements, these modifications or replacements shall be covered within the protection scope of this application. Therefore, the protection scope of this application shall be subject to the protection scope of the claims.

Claims (20)

  1. 一种基于人工智能的文本错误检测方法,应用于用户终端中,其中,所述方法包括:An artificial intelligence-based text error detection method applied to a user terminal, wherein the method includes:
    接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型;Receiving model configuration information input by a user, and configuring parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
    根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型;Training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database to obtain a plurality of error detection models after training;
    若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息;If the text to be detected input by the user is received, input the text to be detected into multiple error detection models to obtain corresponding multiple model detection information;
    从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;Screening and obtaining screening detection data satisfying preset screening conditions from the plurality of model detection information;
    对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。Performing integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
  2. 根据权利要求1所述的基于人工智能的文本错误检测方法,其中,所述根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,包括:The method for text error detection based on artificial intelligence according to claim 1, wherein said training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database comprises:
    根据所述错误检测模型的数量对所述训练语料数据库的训练语料进行随机分配,得到与所述数量相同的多个训练语料集合;Randomly allocate the training corpus of the training corpus database according to the number of the error detection models to obtain multiple training corpus sets of the same number;
    根据所述转换词典对多个所述训练语料集合中的语料信息进行转换,得到与每一所述语料信息对应的一条语料编码;Converting the corpus information in the plurality of training corpus sets according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information;
    据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,得到与每一所述训练语料集合对应的一个训练后的错误检测模型,其中,所述训练语料集合中每一语料编码对应一条预置的目标检测信息。According to a plurality of the training corpus sets, iterative training is performed on one of the error detection models corresponding to each of the training corpus sets to obtain a trained error detection model corresponding to each of the training corpus sets, wherein, Each corpus code in the training corpus corresponds to a piece of preset target detection information.
  3. 根据权利要求2所述的基于人工智能的文本错误检测方法,其中,所述据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,包括:The method for text error detection based on artificial intelligence according to claim 2, wherein the iterative training of one of the error detection models corresponding to each of the training corpus sets on the basis of a plurality of the training corpus sets respectively comprises :
    将一个所述训练语料集合中的一条语料编码输入所述错误检测模型以获取与所述语料编码对应的训练检测信息;Inputting a piece of corpus code in one of the training corpus sets into the error detection model to obtain training detection information corresponding to the corpus code;
    根据预存的损失函数计算公式计算所述训练检测信息与所述语料编码的目标检测信息之间的损失值;Calculating a loss value between the training detection information and the target detection information encoded by the corpus according to a pre-stored loss function calculation formula;
    根据预存的梯度计算公式、所述损失值及所述训练检测信息的计算值计算得到所述错误检测模型中转移矩阵的更新值,更新所述转移矩阵的参数值;Calculating the updated value of the transition matrix in the error detection model according to the pre-stored gradient calculation formula, the loss value and the calculated value of the training detection information, and updating the parameter value of the transition matrix;
    获取所述训练语料集合中的下一条语料编码信息输入所述错误检测模型并重复上述步骤,直至所述训练语料集合包含的所有语料编码信息全部用于训练。Obtain the next piece of corpus coding information in the training corpus set and input it into the error detection model and repeat the above steps until all corpus coding information included in the training corpus set is used for training.
  4. 根据权利要求1所述的基于人工智能的文本错误检测方法,其中,所述将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息,包括:The method for text error detection based on artificial intelligence according to claim 1, wherein said inputting said text to be detected into a plurality of said error detection models respectively to obtain corresponding plurality of model detection information comprises:
    根据所述转换词典将所述待检测文本转换为对应的文本编码;Converting the to-be-detected text into a corresponding text encoding according to the conversion dictionary;
    将所述文本编码分别输入多个所述错误检测模型进行计算,以获取每一所述错误检测模型输出的模型检测信息。The text codes are respectively input to a plurality of the error detection models for calculation, so as to obtain the model detection information output by each of the error detection models.
  5. 根据权利要求1所述的基于人工智能的文本错误检测方法,其中,所述筛选条件包括 检测数据筛选比例及模型筛选比例,所述筛选检测数据包含多组检测数据信息,所述从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据,包括:The method for text error detection based on artificial intelligence according to claim 1, wherein the screening conditions include detection data screening ratios and model screening ratios, and the screening detection data includes multiple sets of detection data information. The screening and testing data that meets the preset screening conditions are screened in the model testing information, including:
    获取每一模型检测信息的综合检测分数,并根据所述综合检测分数对多个模型检测信息进行排序;Acquiring a comprehensive detection score of each model detection information, and sorting a plurality of model detection information according to the comprehensive detection score;
    根据所述模型筛选比例截取排序靠前的多个模型检测信息,作为多组备选模型检测信息;Intercepting a plurality of model detection information ranked at the top according to the model screening ratio, as multiple sets of candidate model detection information;
    根据所述检测数据筛选比例对每一组所述备选模型检测信息进行筛选,获取每一组备选模型检测信息中排序靠前的多个检测数据作为一组所述检测数据信息。Each group of candidate model detection information is screened according to the detection data screening ratio, and a plurality of detection data ranked higher in each group of candidate model detection information is obtained as a set of detection data information.
  6. 根据权利要求1所述的基于人工智能的文本错误检测方法,其中,所述对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果,包括:The artificial intelligence-based text error detection method according to claim 1, wherein said performing integrated processing on said screening detection data to obtain a text detection result matching said text to be detected comprises:
    根据筛选检测数据中的每一组检测数据信息确定每一组检测数据信息的文本错误位置及文本错误类型;Determine the text error location and text error type of each group of detection data information according to each group of detection data information in the screening detection data;
    判断所有所述文本错误位置是否均相同;Determine whether all the text error positions are the same;
    若所有所述文本错误位置不均相同,根据预置的判断规则判断所有所述文本错误位置是否包含单词切分错误;If all the text error positions are not the same, judge whether all the text error positions contain word segmentation errors according to a preset judgment rule;
    将包含单词切分错误的文本错误位置进行删除并返回执行所述判断所有所述文本错误位置是否均相同的步骤;Delete the text error position containing the word segmentation error and return to execute the step of judging whether all the text error positions are the same;
    若所有所述文本错误位置均相同,判断所有所述文本错误类型是否均相同;If all the text error positions are the same, determine whether all the text error types are the same;
    若所有所述文本错误类型均相同,将所述文本错误位置及所述文本错误类型作为所述文本检测结果进行输出。If all the text error types are the same, the text error location and the text error type are output as the text detection result.
  7. 根据权利要求1所述的基于人工智能的文本错误检测方法,其中,所述判断所有所述文本错误位置是否均相同,之前包括:The method for text error detection based on artificial intelligence according to claim 1, wherein said determining whether all the text error positions are the same includes:
    根据预置的语法模板与每一组所述检测数据信息的文本错误位置及文本错误类型进行匹配;Match the text error position and text error type of each group of the detected data information according to the preset grammar template;
    将与所述语法模板相匹配的文本错误位置及文本错误类型进行剔除。The text error location and text error type matching the grammar template are eliminated.
  8. 一种基于人工智能的文本错误检测装置,其中,包括:An artificial intelligence-based text error detection device, which includes:
    检测模型配置单元,用于接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型;The detection model configuration unit is configured to receive model configuration information input by the user, and configure parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
    检测模型训练单元,用于根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型;The detection model training unit is configured to train multiple error detection models according to a preset conversion dictionary and a pre-stored training corpus database to obtain multiple error detection models after training;
    模型检测信息获取单元,用于若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息;The model checking information acquiring unit is configured to, if the to-be-detected text input by the user is received, input the to-be-detected text into multiple error detection models to obtain corresponding multiple model detection information;
    模型检测信息筛选单元,用于从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;The model checking information screening unit is used to screen a plurality of the model checking information to obtain screening and checking data that meets preset screening conditions;
    集成处理单元,用于对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。The integrated processing unit is configured to perform integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
  9. 一种计算机设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运 行的计算机程序,其中,所述处理器执行所述计算机程序时实现以下步骤:A computer device includes a memory, a processor, and a computer program that is stored on the memory and can run on the processor, wherein the processor implements the following steps when the computer program is executed:
    接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型;Receiving model configuration information input by a user, and configuring parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
    根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型;Training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database to obtain a plurality of error detection models after training;
    若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息;If the text to be detected input by the user is received, input the text to be detected into multiple error detection models to obtain corresponding multiple model detection information;
    从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;Screening and obtaining screening detection data satisfying preset screening conditions from the plurality of model detection information;
    对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。Performing integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
  10. 根据权利要求9所述的计算机设备,其中,所述根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,包括:The computer device according to claim 9, wherein the training a plurality of the error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database comprises:
    根据所述错误检测模型的数量对所述训练语料数据库的训练语料进行随机分配,得到与所述数量相同的多个训练语料集合;Randomly allocate the training corpus of the training corpus database according to the number of the error detection models to obtain multiple training corpus sets of the same number;
    根据所述转换词典对多个所述训练语料集合中的语料信息进行转换,得到与每一所述语料信息对应的一条语料编码;Converting the corpus information in the plurality of training corpus sets according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information;
    据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,得到与每一所述训练语料集合对应的一个训练后的错误检测模型,其中,所述训练语料集合中每一语料编码对应一条预置的目标检测信息。According to a plurality of the training corpus sets, iterative training is performed on one of the error detection models corresponding to each of the training corpus sets to obtain a trained error detection model corresponding to each of the training corpus sets, wherein, Each corpus code in the training corpus corresponds to a piece of preset target detection information.
  11. 根据权利要求10所述的计算机设备,其中,所述据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,包括:11. The computer device according to claim 10, wherein the iterative training of one of the error detection models corresponding to each of the training corpus sets on the basis of a plurality of the training corpus sets respectively comprises:
    将一个所述训练语料集合中的一条语料编码输入所述错误检测模型以获取与所述语料编码对应的训练检测信息;Inputting a piece of corpus code in one of the training corpus sets into the error detection model to obtain training detection information corresponding to the corpus code;
    根据预存的损失函数计算公式计算所述训练检测信息与所述语料编码的目标检测信息之间的损失值;Calculating a loss value between the training detection information and the target detection information encoded by the corpus according to a pre-stored loss function calculation formula;
    根据预存的梯度计算公式、所述损失值及所述训练检测信息的计算值计算得到所述错误检测模型中转移矩阵的更新值,更新所述转移矩阵的参数值;Calculating the updated value of the transition matrix in the error detection model according to the pre-stored gradient calculation formula, the loss value and the calculated value of the training detection information, and updating the parameter value of the transition matrix;
    获取所述训练语料集合中的下一条语料编码信息输入所述错误检测模型并重复上述步骤,直至所述训练语料集合包含的所有语料编码信息全部用于训练。Obtain the next piece of corpus coding information in the training corpus set and input it into the error detection model and repeat the above steps until all corpus coding information included in the training corpus set is used for training.
  12. 根据权利要求9所述的计算机设备,其中,所述将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息,包括:9. The computer device according to claim 9, wherein said inputting said text to be detected into a plurality of said error detection models respectively to obtain corresponding plurality of model detection information comprises:
    根据所述转换词典将所述待检测文本转换为对应的文本编码;Converting the to-be-detected text into a corresponding text encoding according to the conversion dictionary;
    将所述文本编码分别输入多个所述错误检测模型进行计算,以获取每一所述错误检测模型输出的模型检测信息。The text codes are respectively input to a plurality of the error detection models for calculation, so as to obtain the model detection information output by each of the error detection models.
  13. 根据权利要求9所述的计算机设备,其中,所述筛选条件包括检测数据筛选比例及模型筛选比例,所述筛选检测数据包含多组检测数据信息,所述从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据,包括:8. The computer device according to claim 9, wherein the screening conditions include a screening ratio of detection data and a screening ratio of a model, the screening detection data includes multiple sets of detection data information, and the screening is performed from a plurality of the model detection information Obtain the screening test data that meets the preset screening conditions, including:
    获取每一模型检测信息的综合检测分数,并根据所述综合检测分数对多个模型检测信息进行排序;Acquiring a comprehensive detection score of each model detection information, and sorting a plurality of model detection information according to the comprehensive detection score;
    根据所述模型筛选比例截取排序靠前的多个模型检测信息,作为多组备选模型检测信息;Intercepting a plurality of model detection information ranked at the top according to the model screening ratio, as multiple sets of candidate model detection information;
    根据所述检测数据筛选比例对每一组所述备选模型检测信息进行筛选,获取每一组备选模型检测信息中排序靠前的多个检测数据作为一组所述检测数据信息。Each group of candidate model detection information is screened according to the detection data screening ratio, and a plurality of detection data ranked higher in each group of candidate model detection information is obtained as a set of detection data information.
  14. 根据权利要求9所述的计算机设备,其中,所述对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果,包括:9. The computer device according to claim 9, wherein the integrated processing of the screening detection data to obtain a text detection result that matches the text to be detected comprises:
    根据筛选检测数据中的每一组检测数据信息确定每一组检测数据信息的文本错误位置及文本错误类型;Determine the text error location and text error type of each group of detection data information according to each group of detection data information in the screening detection data;
    判断所有所述文本错误位置是否均相同;Determine whether all the text error positions are the same;
    若所有所述文本错误位置不均相同,根据预置的判断规则判断所有所述文本错误位置是否包含单词切分错误;If all the text error positions are not the same, judge whether all the text error positions contain word segmentation errors according to a preset judgment rule;
    将包含单词切分错误的文本错误位置进行删除并返回执行所述判断所有所述文本错误位置是否均相同的步骤;Delete the text error position containing the word segmentation error and return to execute the step of judging whether all the text error positions are the same;
    若所有所述文本错误位置均相同,判断所有所述文本错误类型是否均相同;If all the text error positions are the same, determine whether all the text error types are the same;
    若所有所述文本错误类型均相同,将所述文本错误位置及所述文本错误类型作为所述文本检测结果进行输出。If all the text error types are the same, the text error location and the text error type are output as the text detection result.
  15. 根据权利要求9所述的计算机设备,其中,所述判断所有所述文本错误位置是否均相同,之前包括:8. The computer device according to claim 9, wherein said determining whether all the text error positions are the same includes:
    根据预置的语法模板与每一组所述检测数据信息的文本错误位置及文本错误类型进行匹配;Match the text error position and text error type of each group of the detected data information according to the preset grammar template;
    将与所述语法模板相匹配的文本错误位置及文本错误类型进行剔除。The text error location and text error type matching the grammar template are eliminated.
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序当被处理器执行时使所述处理器执行以下操作:A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to perform the following operations:
    接收用户输入的模型配置信息,根据所述模型配置信息对初始化的检测模型中的参数值进行配置以得到多个错误检测模型;Receiving model configuration information input by a user, and configuring parameter values in the initialized detection model according to the model configuration information to obtain multiple error detection models;
    根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,得到训练后的多个错误检测模型;Training a plurality of said error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database to obtain a plurality of error detection models after training;
    若接收到用户所输入的待检测文本,将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息;If the text to be detected input by the user is received, input the text to be detected into multiple error detection models to obtain corresponding multiple model detection information;
    从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据;Screening and obtaining screening detection data satisfying preset screening conditions from the plurality of model detection information;
    对所述筛选检测数据进行集成处理得到与所述待检测文本相匹配的文本检测结果。Performing integrated processing on the screening detection data to obtain a text detection result that matches the text to be detected.
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述根据预设的转换词典及预存的训练语料数据库分别对多个所述错误检测模型进行训练,包括:15. The computer-readable storage medium according to claim 16, wherein the training a plurality of the error detection models respectively according to a preset conversion dictionary and a pre-stored training corpus database comprises:
    根据所述错误检测模型的数量对所述训练语料数据库的训练语料进行随机分配,得到与所述数量相同的多个训练语料集合;Randomly allocate the training corpus of the training corpus database according to the number of the error detection models to obtain multiple training corpus sets of the same number;
    根据所述转换词典对多个所述训练语料集合中的语料信息进行转换,得到与每一所述语料信息对应的一条语料编码;Converting the corpus information in the plurality of training corpus sets according to the conversion dictionary to obtain a piece of corpus information corresponding to each of the corpus information;
    据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,得到与每一所述训练语料集合对应的一个训练后的错误检测模型,其中,所述训练语料集合中每一语料编码对应一条预置的目标检测信息。According to a plurality of the training corpus sets, iterative training is performed on one of the error detection models corresponding to each of the training corpus sets to obtain a trained error detection model corresponding to each of the training corpus sets, wherein, Each corpus code in the training corpus corresponds to a piece of preset target detection information.
  18. 根据权利要求17所述的计算机可读存储介质,其中,所述据多个所述训练语料集合分别对每一所述训练语料集合对应的一个所述错误检测模型进行迭代训练,包括:18. The computer-readable storage medium according to claim 17, wherein the iterative training of one of the error detection models corresponding to each of the training corpus sets on the basis of a plurality of the training corpus sets respectively comprises:
    将一个所述训练语料集合中的一条语料编码输入所述错误检测模型以获取与所述语料编码对应的训练检测信息;Inputting a piece of corpus code in one of the training corpus sets into the error detection model to obtain training detection information corresponding to the corpus code;
    根据预存的损失函数计算公式计算所述训练检测信息与所述语料编码的目标检测信息之间的损失值;Calculating a loss value between the training detection information and the target detection information encoded by the corpus according to a pre-stored loss function calculation formula;
    根据预存的梯度计算公式、所述损失值及所述训练检测信息的计算值计算得到所述错误检测模型中转移矩阵的更新值,更新所述转移矩阵的参数值;Calculating the updated value of the transition matrix in the error detection model according to the pre-stored gradient calculation formula, the loss value and the calculated value of the training detection information, and updating the parameter value of the transition matrix;
    获取所述训练语料集合中的下一条语料编码信息输入所述错误检测模型并重复上述步骤,直至所述训练语料集合包含的所有语料编码信息全部用于训练。Obtain the next piece of corpus coding information in the training corpus set and input it into the error detection model and repeat the above steps until all corpus coding information included in the training corpus set is used for training.
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述将所述待检测文本分别输入多个所述错误检测模型以获取对应的多个模型检测信息,包括:15. The computer-readable storage medium according to claim 16, wherein said inputting said text to be detected into a plurality of said error detection models respectively to obtain corresponding plurality of model detection information comprises:
    根据所述转换词典将所述待检测文本转换为对应的文本编码;Converting the to-be-detected text into a corresponding text encoding according to the conversion dictionary;
    将所述文本编码分别输入多个所述错误检测模型进行计算,以获取每一所述错误检测模型输出的模型检测信息。The text codes are respectively input to a plurality of the error detection models for calculation, so as to obtain the model detection information output by each of the error detection models.
  20. 根据权利要求16所述的计算机可读存储介质,其中,所述筛选条件包括检测数据筛选比例及模型筛选比例,所述筛选检测数据包含多组检测数据信息,所述从多个所述模型检测信息中筛选得到满足预置筛选条件的筛选检测数据,包括:The computer-readable storage medium according to claim 16, wherein the screening conditions include a screening ratio of detection data and a screening ratio of models, the screening detection data includes multiple sets of detection data information, and the detection from a plurality of the models The information is filtered to obtain the screening test data that meets the preset screening conditions, including:
    获取每一模型检测信息的综合检测分数,并根据所述综合检测分数对多个模型检测信息进行排序;Acquiring a comprehensive detection score of each model detection information, and sorting a plurality of model detection information according to the comprehensive detection score;
    根据所述模型筛选比例截取排序靠前的多个模型检测信息,作为多组备选模型检测信息;Intercepting a plurality of model detection information ranked at the top according to the model screening ratio, as multiple sets of candidate model detection information;
    根据所述检测数据筛选比例对每一组所述备选模型检测信息进行筛选,获取每一组备选模型检测信息中排序靠前的多个检测数据作为一组所述检测数据信息。Each group of candidate model detection information is screened according to the detection data screening ratio, and a plurality of detection data ranked higher in each group of candidate model detection information is obtained as a set of detection data information.
PCT/CN2021/083936 2020-11-24 2021-03-30 Text error detection method and apparatus based on artificial intelligence, and computer device WO2021208727A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011329034.7A CN112434131B (en) 2020-11-24 2020-11-24 Text error detection method and device based on artificial intelligence and computer equipment
CN202011329034.7 2020-11-24

Publications (1)

Publication Number Publication Date
WO2021208727A1 true WO2021208727A1 (en) 2021-10-21

Family

ID=74693967

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083936 WO2021208727A1 (en) 2020-11-24 2021-03-30 Text error detection method and apparatus based on artificial intelligence, and computer device

Country Status (2)

Country Link
CN (1) CN112434131B (en)
WO (1) WO2021208727A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826790A (en) * 2022-06-30 2022-07-29 浪潮电子信息产业股份有限公司 Block chain monitoring method, device, equipment and storage medium
CN116754894A (en) * 2023-06-16 2023-09-15 深圳市泰士特线缆有限公司 Combination detection method, device, equipment and medium of communication cable
CN117707987A (en) * 2024-02-06 2024-03-15 暗物智能科技(广州)有限公司 Test case detection method and device, electronic equipment and storage medium

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112434131B (en) * 2020-11-24 2023-09-29 平安科技(深圳)有限公司 Text error detection method and device based on artificial intelligence and computer equipment
CN113192497B (en) * 2021-04-28 2024-03-01 平安科技(深圳)有限公司 Speech recognition method, device, equipment and medium based on natural language processing
CN116662039B (en) * 2023-07-25 2024-01-23 菲特(天津)检测技术有限公司 Industrial information parallel detection method, device and medium based on shared memory

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235721A1 (en) * 2016-02-17 2017-08-17 The King Abdulaziz City For Science And Technology Method and system for detecting semantic errors in a text using artificial neural networks
CN108766437A (en) * 2018-05-31 2018-11-06 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN111209740A (en) * 2019-12-31 2020-05-29 中移(杭州)信息技术有限公司 Text model training method, text error correction method, electronic device and storage medium
CN111737982A (en) * 2020-06-29 2020-10-02 武汉虹信技术服务有限责任公司 Chinese text wrongly-written character detection method based on deep learning
CN111859919A (en) * 2019-12-02 2020-10-30 北京嘀嘀无限科技发展有限公司 Text error correction model training method and device, electronic equipment and storage medium
CN112434131A (en) * 2020-11-24 2021-03-02 平安科技(深圳)有限公司 Text error detection method and device based on artificial intelligence, and computer equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357775A (en) * 2017-06-05 2017-11-17 百度在线网络技术(北京)有限公司 The text error correction method and device of Recognition with Recurrent Neural Network based on artificial intelligence
US11263541B2 (en) * 2017-09-27 2022-03-01 Oracle International Corporation Ensembled decision systems using feature hashing models
CN110287283B (en) * 2019-05-22 2023-08-01 中国平安财产保险股份有限公司 Intention model training method, intention recognition method, device, equipment and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170235721A1 (en) * 2016-02-17 2017-08-17 The King Abdulaziz City For Science And Technology Method and system for detecting semantic errors in a text using artificial neural networks
CN108766437A (en) * 2018-05-31 2018-11-06 平安科技(深圳)有限公司 Audio recognition method, device, computer equipment and storage medium
CN111859919A (en) * 2019-12-02 2020-10-30 北京嘀嘀无限科技发展有限公司 Text error correction model training method and device, electronic equipment and storage medium
CN111209740A (en) * 2019-12-31 2020-05-29 中移(杭州)信息技术有限公司 Text model training method, text error correction method, electronic device and storage medium
CN111737982A (en) * 2020-06-29 2020-10-02 武汉虹信技术服务有限责任公司 Chinese text wrongly-written character detection method based on deep learning
CN112434131A (en) * 2020-11-24 2021-03-02 平安科技(深圳)有限公司 Text error detection method and device based on artificial intelligence, and computer equipment

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114826790A (en) * 2022-06-30 2022-07-29 浪潮电子信息产业股份有限公司 Block chain monitoring method, device, equipment and storage medium
CN116754894A (en) * 2023-06-16 2023-09-15 深圳市泰士特线缆有限公司 Combination detection method, device, equipment and medium of communication cable
CN116754894B (en) * 2023-06-16 2024-02-23 深圳市泰士特线缆有限公司 Combination detection method, device, equipment and medium of communication cable
CN117707987A (en) * 2024-02-06 2024-03-15 暗物智能科技(广州)有限公司 Test case detection method and device, electronic equipment and storage medium
CN117707987B (en) * 2024-02-06 2024-06-11 暗物智能科技(广州)有限公司 Test case detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112434131A (en) 2021-03-02
CN112434131B (en) 2023-09-29

Similar Documents

Publication Publication Date Title
WO2021208727A1 (en) Text error detection method and apparatus based on artificial intelligence, and computer device
US11636264B2 (en) Stylistic text rewriting for a target author
CN109446517B (en) Reference resolution method, electronic device and computer readable storage medium
CN110276023B (en) POI transition event discovery method, device, computing equipment and medium
JP2020520492A (en) Document abstract automatic extraction method, device, computer device and storage medium
US10585989B1 (en) Machine-learning based detection and classification of personally identifiable information
US20180068221A1 (en) System and Method of Advising Human Verification of Machine-Annotated Ground Truth - High Entropy Focus
US11392853B2 (en) Methods and arrangements to adjust communications
CN111985229A (en) Sequence labeling method and device and computer equipment
WO2021072863A1 (en) Method and apparatus for calculating text similarity, electronic device, and computer-readable storage medium
CN113158656B (en) Ironic content recognition method, ironic content recognition device, electronic device, and storage medium
CN114676704A (en) Sentence emotion analysis method, device and equipment and storage medium
CN114818891A (en) Small sample multi-label text classification model training method and text classification method
CN113626576A (en) Method and device for extracting relational characteristics in remote supervision, terminal and storage medium
CN112906361A (en) Text data labeling method and device, electronic equipment and storage medium
CN112214595A (en) Category determination method, device, equipment and medium
US11481547B2 (en) Framework for chinese text error identification and correction
CN114266252A (en) Named entity recognition method, device, equipment and storage medium
CN111581377B (en) Text classification method and device, storage medium and computer equipment
CN113688232A (en) Method and device for classifying bidding texts, storage medium and terminal
CN113591881A (en) Intention recognition method and device based on model fusion, electronic equipment and medium
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
WO2023116572A1 (en) Word or sentence generation method and related device
KR20190101551A (en) Classifying method using a probability labele annotation algorithm using fuzzy category representation
CN111680513B (en) Feature information identification method and device and computer readable storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21788678

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21788678

Country of ref document: EP

Kind code of ref document: A1