WO2023005293A1 - 文本纠错方法、装置、设备及存储介质 - Google Patents
文本纠错方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2023005293A1 WO2023005293A1 PCT/CN2022/088892 CN2022088892W WO2023005293A1 WO 2023005293 A1 WO2023005293 A1 WO 2023005293A1 CN 2022088892 W CN2022088892 W CN 2022088892W WO 2023005293 A1 WO2023005293 A1 WO 2023005293A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- word
- entity
- named entity
- corrected
- Prior art date
Links
- 238000012937 correction Methods 0.000 title claims abstract description 185
- 238000000034 method Methods 0.000 title claims abstract description 57
- 230000011218 segmentation Effects 0.000 claims abstract description 63
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims description 32
- 238000004364 calculation method Methods 0.000 claims description 15
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000004458 analytical method Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 8
- 238000007477 logistic regression Methods 0.000 claims description 5
- 238000010845 search algorithm Methods 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims 5
- 238000001514 detection method Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 7
- 238000007405 data analysis Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 8
- 238000011176 pooling Methods 0.000 description 7
- 239000013598 vector Substances 0.000 description 5
- 230000000694 effects Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000008520 organization Effects 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 229910000906 Bronze Inorganic materials 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 239000010974 bronze Substances 0.000 description 1
- 239000000969 carrier Substances 0.000 description 1
- 238000010225 co-occurrence analysis Methods 0.000 description 1
- KUNSUQLRTQLHQQ-UHFFFAOYSA-N copper tin Chemical compound [Cu].[Sn] KUNSUQLRTQLHQQ-UHFFFAOYSA-N 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/232—Orthographic correction, e.g. spell checking or vowelisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Definitions
- the present application relates to the field of data analysis, and in particular to a text error correction method, device, equipment and storage medium.
- the error correction of text is mainly based on the degree of confusion of the language model and the similarity of glyphs and phonetics to select the replacement word with the highest probability.
- the inventor realized that the existing technology can only deal with typos, and cannot deal with other situations such as multiple characters and missing characters. It requires the cooperation of various other technologies and cannot be solved systematically as a whole, resulting in low error correction efficiency and low accuracy. .
- the main purpose of this application is to solve the technical problems of low text error correction efficiency and low accuracy in the prior art.
- the first aspect of the present application provides a text error correction method.
- the text error correction method includes: obtaining the text to be error corrected, and performing word segmentation processing on the text to be error corrected to obtain a named entity set; Set input to the preset convolutional neural network for domain identification, determine the vertical domain and type of each named entity in the named entity set; select the domain knowledge graph corresponding to the vertical domain from the preset domain knowledge graph set, And select the candidate entity corresponding to the type from the domain knowledge map; calculate the matching degree between the named entity and the corresponding candidate entity, and generate a correction set according to the matching degree; from the correction set Select candidate entities, correct the corresponding named entities in the text to be corrected, and obtain the corrected text.
- the second aspect of the present application proposes a text error correction device, including a memory, a processor, and computer-readable instructions stored on the memory and operable on the processor, and the processor executes the computer-readable
- the following steps are implemented during the instruction: obtain the text to be corrected, and perform word segmentation processing on the text to be corrected to obtain a named entity set; input the named entity set into a preset convolutional neural network for domain identification, and determine The vertical domain and type of each named entity in the named entity set; select the domain knowledge graph corresponding to the vertical domain from the preset domain knowledge graph set, and select the candidate corresponding to the type from the domain knowledge graph Entity; calculate the matching degree between the named entity and the corresponding candidate entity, and generate a correction set according to the matching degree; select a candidate entity from the correction set, and perform the corresponding named entity in the text to be corrected Make corrections to get the revised text.
- the third aspect of the present application proposes a computer-readable storage medium, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are run on the computer, the computer is made to perform the following steps: obtain the text to be corrected, and Perform word segmentation processing on the text to be corrected to obtain a named entity set; input the named entity set into a preset convolutional neural network for domain identification, and determine the vertical domain and type of each named entity in the named entity set ; Select the domain knowledge graph corresponding to the vertical field from the preset domain knowledge graph set, and select the candidate entity corresponding to the type from the domain knowledge graph; calculate the relationship between the named entity and the corresponding candidate entity and generate a correction set according to the matching degree; select candidate entities from the correction set, correct the corresponding named entities in the text to be corrected, and obtain the corrected text.
- a text error correction device includes: a word segmentation module, which is used to obtain the text to be error corrected, and perform word segmentation processing on the text to be error corrected to obtain a named entity; an identification module , for inputting the named entity set into a preset convolutional neural network for domain identification, and determining the vertical domain and type of each named entity in the named entity set; the selection module is used for selecting from a preset domain knowledge graph Centrally select the domain knowledge map corresponding to the vertical field, and select the candidate entity corresponding to the type from the domain knowledge map; the calculation module is used to calculate the matching between the named entity and the corresponding candidate entity degree, and generate a correction set according to the matching degree; a correction module is configured to select candidate entities from the correction set, correct the corresponding named entities in the text to be corrected, and obtain the corrected text.
- a word segmentation module which is used to obtain the text to be error corrected, and perform word segmentation processing on the text to be error corrected to obtain a named entity
- an identification module for inputting the named entity set
- the named entity set is obtained by performing word segmentation on the text to be corrected; the named entity set is input into the preset convolutional neural network for domain identification, and the vertical direction of each named entity in the named entity set is determined. Domain and type; select the domain knowledge graph corresponding to the vertical field from the preset domain knowledge graph set, and select the candidate entity corresponding to the type of named entity from the domain knowledge graph; calculate the matching degree between the named entity and the candidate entity, and A correction set is generated according to the matching degree; candidate entities are selected from the correction set, and the corresponding named entities in the text to be corrected are corrected to obtain the corrected text.
- the technical solution provided by this application improves the efficiency and accuracy of error correction by invoking domain knowledge graphs, selecting candidate entities, and performing targeted corrections to errors in the text to be corrected.
- Fig. 1 is the schematic diagram of the first embodiment of the text error correction method in the embodiment of the present application
- Fig. 2 is the schematic diagram of the second embodiment of the text error correction method in the embodiment of the present application.
- FIG. 3 is a schematic diagram of a third embodiment of the text error correction method in the embodiment of the present application.
- FIG. 4 is a schematic diagram of a fourth embodiment of the text error correction method in the embodiment of the present application.
- FIG. 5 is a schematic diagram of an embodiment of a text error correction device in the embodiment of the present application.
- FIG. 6 is a schematic diagram of another embodiment of the text error correction device in the embodiment of the present application.
- FIG. 7 is a schematic diagram of an embodiment of a text error correction device in the embodiment of the present application.
- the embodiment of the present application provides a text error correction method, device, device, and storage medium.
- a named entity set is obtained; the named entity set is input into a preset convolutional neural network for domain Identify and determine the vertical domain and type of each named entity in the named entity set; select the domain knowledge graph corresponding to the vertical domain from the preset domain knowledge graph set, and select the candidate entity corresponding to the type of the named entity from the domain knowledge graph; Calculate the matching degree between the named entity and the candidate entity, and generate a correction set according to the matching degree; select the candidate entity from the correction set, correct the corresponding named entity in the text to be corrected, and obtain the corrected text.
- by invoking the domain knowledge map, selecting candidate entities, and performing targeted corrections to the errors in the text to be corrected thereby improving the efficiency and accuracy of error correction.
- the first embodiment of the text error correction method in the embodiment of the present application includes:
- the server obtains the text to be corrected, and performs word segmentation processing on the text to be corrected, wherein the word segmentation process needs to be combined with a preset word segmentation dictionary.
- the word segmentation dictionary refers to a database containing commonly used or fixed words, which is the benchmark for word segmentation.
- the sentences in the input text to be corrected are converted into independent words with the maximum character length, that is, the maximum character length
- the independent words of are named entities, and each named entity is collected to form a named entity set.
- the named entity is a person name, an organization name, a place name, and all other entities identified by names.
- the broader entities are numbers, dates, currencies, addresses, and more.
- word segmentation refers to the process of dividing character strings in the text to be corrected into word strings.
- the word segmentation method may be a forward maximum matching method, a reverse best matching method, a conditional random field model or a hidden Markov model.
- the forward maximum matching method is characterized by high word segmentation efficiency, linear time complexity, easy implementation, and does not need to specify the maximum length of words;
- the reverse maximum matching method is characterized by linear time complexity, and needs to specify the maximum length of words (maxLen );
- the characteristic of the hidden Markov model is that the recognition effect of unregistered words is better than that of the maximum matching method, but the overall effect depends on the training corpus;
- the characteristic of the conditional random field model is that it not only considers the frequency of words, but also considers the context. It has good learning ability, so it has a good effect on the recognition of ambiguous words and unregistered words.
- this embodiment corrects the word segmentation result of the forward maximum matching method by adding a backtracking mechanism.
- backtracking refers to a tentative method that uses a backward strategy to correct the current word segmentation result during the word segmentation process.
- the network structure of the CNN network includes an input layer, a network layer and an output layer; wherein the input layer is to be
- Each named entity in the named entity set of the error correction text is input to the network layer, and the output layer is the output of the network layer, which calculates the probability of each professional field involved in the named entity through the logistic regression function (softmax function), and determines the name according to the probability.
- softmax function logistic regression function
- the vertical field of the entity (professional field); the network layer includes four parts: the convolutional layer, the pooling layer, the feature connection layer, and the fully connected layer; the convolutional layer is designed with two channels, and the convolution window size of the first channel is is 1, the convolution window size of the second channel is 2, so that the CNN network can extract the features of a single word and adjacent words in the text to be corrected; the pooling layer uses the maximum pooling to obtain each channel output by the convolutional layer The most obvious feature of the feature; the feature connection layer splices the features of the two channels output by the pooling layer together to obtain a feature matrix; the fully connected layer finally classifies the feature matrix output by the feature connection layer to obtain the type of named entity, according to Types of Named Entities Store named entities into ⁇ k, v ⁇ collections, where k represents the named entity and v represents the type of the named entity.
- the domain knowledge graph set contains multiple domain knowledge graphs, and the candidate entities are named entities in the domain knowledge graph.
- the general domain knowledge map it is a series of different graphics showing the knowledge development process and structural relationship, using visualization technology to describe knowledge resources and their carriers, mining, analyzing, constructing, drawing and displaying knowledge and the interconnection between them.
- Knowledge map is a combination of theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrology citation analysis, co-occurrence analysis and other methods, and uses the visual map to vividly display the core structure of the subject, Modern theories that develop history, frontier fields, and overall knowledge structure to achieve multidisciplinary integration.
- the knowledge graph of a specific field has higher requirements on the accuracy of knowledge, including defining the concept, category, association, and attribute constraints of data.
- the server calculates the matching degree between the named entity and the corresponding candidate entity, and generates a correction set according to the matching degree.
- the set ⁇ k, v ⁇ is compared with the candidate entity (g) of type v in G in turn, if the named entity k and the candidate entity g A complete match means that the text to be corrected does not need to be corrected, that is, the higher the matching degree between the named entity and the candidate entity, the lower the correction rate of the named entity.
- the candidate entity g does not exactly match the named entity k, then extract and gather the candidate entity g with the highest matching degree with the named entity k to form a correction set C_k.
- the correction set only includes the candidate entity g.
- the server calculates the occurrence probability of each candidate entity in the correction set in the text to be corrected according to the preset domain language model.
- the candidate entities are sorted according to the comparison results of the corresponding occurrence probabilities to generate an occurrence sequence. According to the sorting results, the candidate entity with the highest probability of occurrence is selected from the occurrence sequence to modify the named entity of the text to be corrected, so as to obtain the corrected text.
- the named entities that need to be corrected are used as confusing words, and are collected into a confusing dictionary, and the confusing dictionary is called to traverse each word in the text to be corrected after word segmentation processing, so as to improve the efficiency and accuracy of error correction. Rate.
- word segmentation processing is performed on the text to be corrected to obtain named entities, and candidate entities consistent with the type of named entities are selected from the domain knowledge map corresponding to the vertical field of the text to be corrected, and the relationship between the named entity and the candidate entity is calculated
- the matching degree thus generates a correction set, and the text to be corrected is corrected according to the correction set.
- the efficiency and accuracy of error correction are improved.
- the second embodiment of the text error correction method in the embodiment of the present application includes:
- the server obtains the text to be corrected, and calls the dictionary of the Chinese word segmentation tool (jieba) as the dictionary used for word segmentation of the text to be corrected, deletes some words that are not commonly used, and retains correct and commonly used words as much as possible to reduce the error rate.
- the capacity of the small tokenizer Call the dictionary to generate a prefix tree (trie tree) for the text to be corrected. Scan the word map of the trie tree structure, that is, put the words in the dictionary into a trie tree. If the first few words of a word are the same, it means that they have the same prefix, and the trie tree can be used to store them to improve the search speed.
- the sentences in the text to be corrected are subjected to word graph scanning processing according to a preset dictionary to generate a directed acyclic graph.
- word graph scanning processing according to a preset dictionary to generate a directed acyclic graph.
- Segment the text to be corrected according to the obtained maximum segmentation combination that is, perform word segmentation according to the character combination to obtain a word sequence.
- the word sequence is input into the named entity recognition model based on the word sequence preset by the server, and the recognition result is output, that is, the named entities in the text to be corrected are identified and collected into a named entity set.
- a named entity recognition model based on a word sequence is adopted, and the input of the model is a word sequence instead of a word sequence, so that the recognition efficiency can be improved, and the memory usage can also be reduced.
- the named entity recognition model (Named Entity Recognition, referred to as NER) is an important basic tool in the application fields of information extraction, question answering system, syntax analysis, machine translation, etc. important position.
- the task of the named entity recognition model is to identify three major categories (entity category, time category and number category) and seven subcategories (person name, organization name, place name, time, date, currency and percentage) in the text to be processed. named entity.
- CNN model preset convolutional neural network
- softmax logistic regression function
- the fully connected layer in the convolutional neural network has two hidden layers, and the number of nodes in the output layer of the fully connected layer is consistent with the number of types of preset named entities; the output layer of the CNN model uses softmax The function is used to calculate the probability of each domain, that is, to calculate the domain attribute value.
- the type feature information of each named entity in the named entity set is extracted, and the matching degree between the type feature information and the preset named entity type is calculated, that is, the type feature information and the preset The similarity between types by which the type of the named entity is determined.
- the CNN model input layer inputs a named entity with a dimension of 8*271; the convolution layer has two channels, and the convolution window dimensions of the two channels are 1*271 and 2*271 respectively, and each channel has There are 512 convolution kernels.
- the output of the convolutional layer is a matrix of 8*512 and 7*512 respectively.
- the pooling layer performs the maximum pooling operation on the output of the convolutional layer, and outputs two types of feature information of 1*512 and 1*512, and the type feature information is output in the form of feature vectors, so the text to be corrected can be obtained 1028 types of feature information.
- the feature connection layer splices the two outputs of the pooling layer together to form a 1*1028 type feature information, inputs the type feature information into the fully connected layer of the CNN model, and outputs the type of the named entity.
- steps 210-212 are consistent with steps 103-105 in the above-mentioned first embodiment of the text error correction method, and will not be repeated here.
- the text to be corrected is generated into a directed acyclic graph
- the preset search maximum probability path algorithm is called to find the maximum segmentation combination based on word frequency from the directed acyclic graph , perform word segmentation processing on the maximum segmentation combination to obtain a word sequence and input it into the named entity recognition model based on word sequence to determine the named entity, perform a series of processing on the text to be corrected to generate a word sequence, and then input it into the named entity recognition model
- the recognition of named entities is based on word sequences in the system, so that errors can be located accurately and quickly according to named entities, and the efficiency of error correction is improved.
- the third embodiment of the text error correction method in the embodiment of the present application includes:
- the corresponding candidate entities are extracted and collected to generate a correction set.
- each named entity is composed of one or more words, and the word is called the target word, and the word image of the target word and the glyph vector containing the glyph features are determined; wherein, convolution can be based on The neural network determines the glyph vectors of word images.
- the word image of the target word is generated according to the writing method of multiple fonts corresponding to the target word. Specifically, the font images corresponding to the target word in different fonts are determined, and all font images of the target word are spliced to generate a word image with a depth of D, where D is the number of font images of the target word.
- the font images of multiple fonts are used to generate a glyph vector including glyph features, so that the glyph vector of the target word includes the glyph features of multiple fonts.
- the font features of the candidate entities are determined according to the above steps, and the font similarity is calculated according to the font features of the named entity and the candidate entities, so that the font similarity is compared with the font similarity threshold.
- the "fonts" in this embodiment may also include fonts of different historical periods, such as bronze inscriptions, cursive script, Wei stele, etc., as long as the fonts can include font features.
- the server performs phonetic conversion on the Chinese characters in the named entity to generate pinyin, and splices other pinyin in the named entity to generate a pinyin string.
- the corresponding candidate entity analyzes the pronunciation of the named entity and the corresponding candidate entity, and calculate the pronunciation similarity between the named entity and the candidate entity, compare the pronunciation similarity with the preset pronunciation similarity threshold, when the pronunciation similarity is greater than the font similarity threshold
- the corresponding candidate entities are extracted and collected to generate a correction set.
- formulate the phonetic-phonetic code mapping rule divide the pronunciation of Chinese characters into 10 parts by mapping the phonetic part of a Chinese character to a character bit according to a simple substitution rule.
- the pronunciation mainly covers the finals, initials, complements and tones, covering 4 characters, the first final, 24 finals from the final “a” to “ong”, composed of numbers “1-9” and letters “A-K “Replacement, the second is the initial consonant position, the same, it is also replaced by the number “1-9” and the letter “A-J", wherein "Z” and “ZH” are the same conversion; the fourth is the tone position, respectively with " 1-4" to replace the four tones in Chinese characters.
- the named entity and the candidate entity are coded respectively according to the phonetic-phonetic code mapping rules, and the similarity of the coded results is compared. Among them, the distance algorithm is used to compare the similarity between the two codes, and the named entity and the candidate entity are obtained. phonetic similarity.
- the server encodes the words or characters in the named entity and the candidate entity respectively according to the preset encoding rules, that is, the preset encoding identifier identifies the word or character, thereby converting the word or character into the corresponding encoding identifier character, generate two encoded identification strings, and compare the similarity of the encoded results, that is, compare the structure of the encoded identifiers between the two encoded identification strings, and when the two encoded identification strings are in the same order , to determine whether the numbers are consistent, wherein, comparing the similarity between the two coded identification strings, the distance algorithm is used to obtain the similarity between the word combination of the named entity and the candidate entity.
- the preset encoding identifier identifies the word or character, thereby converting the word or character into the corresponding encoding identifier character
- the server encodes the words or characters in the named entity and the candidate entity respectively according to the preset encoding rules, that is, the preset encoding identifier identifies the word or
- the word order of the named entity and the candidate entity is analyzed separately to determine whether the candidate entity is composed of the word order adjusted by the named entity, calculate the similarity of the word order of the named entity and the candidate entity, and compare the word order similarity with The preset word sequence similarity threshold is compared, and when the word sequence similarity is greater than the word sequence similarity threshold, the corresponding candidate entities are extracted and collected to generate a correction set.
- the server encodes the words or characters in the named entity and the candidate entity respectively according to the preset encoding rules, that is, the preset encoding identifier identifies the word or character, thereby converting the word or character into the corresponding encoding identifier character, generate two encoded identification strings, and compare the similarity of the encoded results, that is, compare the structure of the encoded identifiers between the two encoded identification strings, and the number of encoded identifiers in the two encoded identification strings is the same And if the coded identifiers are consistent, it is judged whether their arrangement order is consistent.
- the distance algorithm is used to compare the similarity between the two coded identifier strings to obtain the similarity of the word order of the named entity and the candidate entity.
- the server calculates the occurrence probability of each candidate entity in the correction set in the text to be corrected according to the preset domain language model.
- the candidate entities are sorted according to the comparison results of the corresponding occurrence probabilities to generate an occurrence sequence. According to the sorting results, the candidate entity with the highest probability of occurrence is selected from the occurrence sequence to modify the named entity of the text to be corrected, so as to obtain the corrected text.
- the named entities that need to be corrected are used as confusing words, and are collected into a confusing dictionary, and the confusing dictionary is called to traverse each word in the text to be corrected after word segmentation processing, so as to improve the efficiency and accuracy of error correction. Rate.
- steps 301-303 are consistent with steps 101-103 in the first embodiment of the text error correction method described above, and will not be repeated here.
- font analysis, phonetic analysis, and word structure analysis are performed on the named entity and the candidate entity, so that various errors in the text can be identified, and the text to be corrected is corrected in a targeted manner, which improves the accuracy of text correction. wrong accuracy.
- the fourth embodiment of the text error correction method in the embodiment of the present application includes:
- correction set contains multiple candidate entities, calculate the occurrence probability of the candidate entities in the text to be corrected according to the preset domain language model;
- the server calculates the occurrence probability of each candidate entity in the correction set in the text to be corrected according to the preset domain language model.
- the candidate entities are sorted according to the comparison results of the corresponding occurrence probabilities to generate an occurrence sequence. According to the sorting results, the candidate entity with the highest probability of occurrence is selected from the occurrence sequence to modify the named entity of the text to be corrected, so as to obtain the corrected text.
- steps 401-404 are consistent with steps 101-104 in the first embodiment of the text error correction method described above, and will not be repeated here.
- the probability of occurrence of the candidate entity in the text to be corrected is calculated, and the candidate entity with the highest probability of occurrence is selected according to the probability of occurrence to correct the text to be corrected, which improves the accuracy of the correction of the named entity in the text to be corrected Spend.
- An embodiment of the text error correction device in the embodiment of the present application includes:
- a word segmentation module 501 configured to obtain text to be corrected, and perform word segmentation processing on the text to be corrected to obtain a named entity set;
- An identification module 502 configured to input the named entity set into a preset convolutional neural network for domain identification, and determine the vertical domain and type of each named entity in the named entity set;
- a selection module 503 configured to select a domain knowledge graph corresponding to the vertical domain from a preset domain knowledge graph set, and select a candidate entity corresponding to the type from the domain knowledge graph;
- Calculation module 504 configured to calculate the matching degree between the named entity and the corresponding candidate entity, and generate a correction set according to the matching degree
- the correction module 505 is configured to select candidate entities from the correction set, correct the corresponding named entities in the text to be corrected, and obtain the corrected text.
- the text to be corrected is subjected to word segmentation processing by the text error correction device to obtain the named entity, and the candidate entity consistent with the type of the named entity is selected from the domain knowledge map corresponding to the vertical field of the text to be corrected, and the named entity is calculated.
- the matching degree between the entity and the candidate entity generates a correction set, and the text to be corrected is corrected according to the correction set.
- FIG. 6 another embodiment of the text error correction device in the embodiment of the present application includes:
- a word segmentation module 501 configured to obtain text to be corrected, and perform word segmentation processing on the text to be corrected to obtain a named entity set;
- An identification module 502 configured to input the named entity set into a preset convolutional neural network for domain identification, and determine the vertical domain and type of each named entity in the named entity set;
- a selection module 503 configured to select a domain knowledge graph corresponding to the vertical domain from a preset domain knowledge graph set, and select a candidate entity corresponding to a type from the domain knowledge graph;
- Calculation module 504 configured to calculate the matching degree between the named entity and the corresponding candidate entity, and generate a correction set according to the matching degree
- the correction module 505 is configured to select candidate entities from the correction set, correct the corresponding named entities in the text to be corrected, and obtain the corrected text.
- calculation module 504 includes:
- the first calculation unit 5041 is used to calculate the glyph similarity between the named entity and the corresponding candidate entity, and if the glyph similarity is greater than a preset glyph similarity threshold, gather the candidate entities to generate a correction set;
- the second calculation unit 5042 is used to calculate the phonetic similarity between the named entity and the corresponding candidate entity, and if the phonetic similarity is greater than a preset phonetic similarity threshold, gather the candidate entities to generate a correction set;
- the third calculation unit 5043 is configured to analyze the word structure of the named entity and the corresponding candidate entity, and determine the similarity between the named entity and the candidate entity based on the word structure, if the similarity If the degree is greater than the preset word structure similarity threshold, the candidate entities are collected to generate a revised set.
- the calculation module 504 also includes a conversion unit 5044, which is specifically used for:
- the named entity is a mixed spelling of Pinyin and Chinese characters
- the corresponding Chinese characters in the named entity are converted into Pinyin based on a preset pinyin conversion algorithm.
- the third calculation unit 5043 is specifically used for:
- Analyzing the word order between the named entity and the corresponding candidate entity calculating the similarity of the word order; if the similarity of the word order is greater than the preset word order similarity threshold, gathering the candidate entities to generate a correction gather.
- correction module 505 includes:
- a judging unit 5051 configured to judge whether the correction set contains multiple candidate entities
- a calculation unit 5052 configured to calculate the occurrence probability of the candidate entity in the text to be corrected according to a preset domain language model if the correction set contains a plurality of the candidate entities;
- a sorting unit 5053 configured to sort the candidate entities according to the magnitude of the occurrence probability to obtain an occurrence sequence
- the correction unit 5054 is configured to select candidate entities from the correction set according to the occurrence sequence, and correct the corresponding named entities in the text to be corrected to obtain the corrected text.
- the word segmentation module 501 is specifically used for:
- Segmenting the text to be corrected according to the maximum segmentation combination to obtain a word sequence inputting the word sequence into a preset word sequence-based named entity recognition model, and outputting a named entity set.
- the identification module 502 is specifically used for:
- a matching degree between the type characteristic information and a preset type is calculated, and the type of the named entity is determined according to the matching degree.
- a text error correction device is used to perform a series of processing on the text to be corrected to generate a word sequence, which is then input into the named entity recognition model to identify the named entity based on the word sequence, so that the named entity can be accurately and quickly Locating errors, performing font analysis, phonetic analysis, and word structure analysis on named entities and candidate entities, so that various errors in the text can be identified, targeted corrections are made to the text to be corrected, and the candidate entities are calculated in the text to be corrected.
- the probability of occurrence in and according to the probability of occurrence, the candidate entity with the highest probability of occurrence is selected to correct the text to be corrected, which improves the accuracy of the correction of named entities in the text to be corrected.
- FIG. 7 is a schematic structural diagram of a text error correction device provided by an embodiment of the present application.
- the text error correction device 700 may have relatively large differences due to different configurations or performances, and may include one or more than one processor (central processing units) , CPU) 710 (eg, one or more processors) and memory 720, one or more storage media 730 (eg, one or more mass storage devices) for storing application programs 733 or data 732.
- the memory 720 and the storage medium 730 may be temporary storage or persistent storage.
- the program stored in the storage medium 730 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations for the text error correction device 700 .
- the processor 710 may be configured to communicate with the storage medium 730 , and execute a series of instruction operations in the storage medium 730 on the text error correction device 700 .
- the text error correction device 700 can also include one or more power supplies 740, one or more wired or wireless network interfaces 750, one or more input and output interfaces 760, and or or, one or more operating systems 731, such as Windows Server , Mac OS X, Unix, Linux, FreeBSD, etc.
- operating systems 731 such as Windows Server , Mac OS X, Unix, Linux, FreeBSD, etc.
- Blockchain essentially a decentralized database, is a series of data blocks associated with each other using cryptographic methods. Each data block contains a batch of network transaction information, which is used to verify its Validity of information (anti-counterfeiting) and generation of the next block.
- the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
- the present application also provides a computer-readable storage medium.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- the computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are run on the computer, the computer is made to execute the steps of the text error correction method.
- the integrated unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
- the technical solution of the present application is essentially or part of the contribution to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
- the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Machine Translation (AREA)
Abstract
本申请涉及数据分析领域,公开了一种文本纠错方法、装置、设备及存储介质,该方法包括:对待纠错文本进行分词处理,得到命名实体集;将命名实体集中输入至预设的卷积神经网络中进行领域识别,确定命名实体集中各命名实体的垂直领域及类型;从领域知识图谱集中选取领域知识图谱和候选实体;计算命名实体与候选实体的匹配度,并根据匹配度生成修正集合;从修正集合中选取候选实体,对待纠错文本进行修正,得到修正文本。本申请通过调用领域知识图谱,选取候选实体,对待纠错文本中出现的错误进行针对性的修正,从而提高了纠错效率和准确度。此外,本申请还涉及区块链技术,待纠错文本和修正文本可存储于区块链中。
Description
本申请要求于2021年7月30日提交中国专利局、申请号为202110873540.0、发明名称为“文本纠错方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
本申请涉及数据分析领域,尤其涉及一种文本纠错方法、装置、设备及存储介质。
人工智能正在以一种前所未有的力量影响着各行各业,而智能客服作为此次技术革新的排头兵已经在垂直领域生根发芽,即将开花。但是用户在文字交互中经常出现输入错误,包括错字、缺字、多字、字序错误、拼音汉字混合等各种各样的错误。其中有些错误无关紧要,不影响系统后续处理,有些错误会对系统的后续自动处理有非常大的影响,差之毫厘谬以千里。
目前,对文本进行纠错主要是基于语言模型的混淆度和字形字音的相似度去选取最大概率的替换字。但是发明人意识到现有技术只能处理错字,不能处理多字、漏字等其他的情况,需要其他多种技术去配合,不能从整体上去系统解决,从而导致纠错效率低、准确度低。
发明内容
本申请的主要目的在于解决现有技术中文本纠错效率低、准确度低的技术问题。
本申请第一方面提供了一种文本纠错方法,所述文本纠错方法包括:获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
本申请第二方面提出一种文本纠错设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
本申请第三方面提出一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
本申请第四方面提出一种文本纠错装置,所述文本纠错装置包括:分词模块,用于获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体;识别模块,用于将所 述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;选取模块,用于从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算模块,用于计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;修正模块,用于从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
在本申请提供的技术方案中,通过对待纠错文本进行分词处理,得到命名实体集;将命名实体集输入至预设的卷积神经网络中进行领域识别,确定命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与垂直领域对应的领域知识图谱,并从领域知识图谱中选取与命名实体的类型对应的候选实体;计算命名实体与候选实体的匹配度,并根据匹配度生成修正集合;从修正集合中选取候选实体,对待纠错文本中对应的命名实体进行修正,得到修正文本。本申请提供的技术方案通过调用领域知识图谱,选取候选实体,对待纠错文本中出现的错误进行针对性的修正,从而提高了纠错效率和准确度。
图1为本申请实施例中文本纠错方法的第一个实施例示意图;
图2为本申请实施例中文本纠错方法的第二个实施例示意图;
图3为本申请实施例中文本纠错方法的第三个实施例示意图;
图4为本申请实施例中文本纠错方法的第四个实施例示意图;
图5为本申请实施例中文本纠错装置的一个实施例示意图;
图6为本申请实施例中文本纠错装置的另一个实施例示意图;
图7为本申请实施例中文本纠错设备的一个实施例示意图。
本申请实施例提供了一种文本纠错方法、装置、设备及存储介质,通过对待纠错文本进行分词处理,得到命名实体集;将命名实体集输入至预设的卷积神经网络中进行领域识别,确定命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与垂直领域对应的领域知识图谱,并从领域知识图谱中选取与命名实体的类型对应的候选实体;计算命名实体与候选实体的匹配度,并根据匹配度生成修正集合;从修正集合中选取候选实体,对待纠错文本中对应的命名实体进行修正,得到修正文本。本申请实施例通过调用领域知识图谱,选取候选实体,对待纠错文本中出现的错误进行针对性的修正,从而提高了纠错效率和准确度。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本申请实施例的具体内容进行描述,请参阅图1,本申请实施例中文本纠错方法的第一个实施例包括:
101,获取待纠错文本,并对待纠错文本进行分词处理,得到命名实体集;
服务器获取待纠错文本,并对该待纠错文本进行分词处理,其中,分词处理需要结合预设的分词词典。分词词典是指包括有常用的或固定的词语的数据库,其是分词的基准,通过比照分词词典以使输入的待纠错文本中的语句转化为具有最大字符长度的独立词语,即最大字符长度的独立词语为命名实体,并汇集各命名实体形成命名实体集。在本实施例中,命名实体就是人名、机构名、地名以及其他所有以名称为标识的实体。其中,更广泛 的实体还包括数字、日期、货币、地址等等。
在本实施例中,分词是指将待纠错文本中的字符串划分为词串的过程。其中,分词方法可以为正向最大匹配法、逆向最匹配法、条件随机场模型或隐马尔可夫模型。正向最大匹配法的特点是分词效率高,具有线性时间复杂度,容易实现,不需要指定词语的最大长度;逆向最大匹配法的特点是具有线性时间复杂度,需要指定词语的最大长度(maxLen);隐马尔可夫模型的特点是对未登录词的识别效果优于最大匹配法,但整体效果依赖于训练语料;条件随机场模型的特点是不仅考虑了词语出现的频率,还考虑上下文,具备较好的学习能力,因此其对歧义词和未登录词的识别都具有良好的效果。
进一步的,在本实施例中,在调用正向最大匹配法对待纠错文本中的语句进行顺向扫描,当存在交集型歧义时很可能产生分词错误。因此,本实施例通过增加回溯机制来校正正向最大匹配法的分词结果。其中,回溯是指在分词过程中,采用后退的策略以修正当前分词结果的试探方法。通过增加回溯机制能够提高分词准确率,有效改善交集型歧义问题。
102,将命名实体集输入至预设的卷积神经网络中进行领域识别,确定命名实体集中各命名实体的垂直领域及类型;
将命名实体集输入至预设的卷积神经网络(CNN网络)中进行领域识别,在本实施例中,CNN网络的网络结构包含输入层、网络层和输出层;其中,输入层是将待纠错文本的命名实体集中的各命名实体输入到网络层,输出层是网络层的输出,其通过逻辑回归函数(softmax函数)计算命名实体所涉及的各个专业领域的概率,根据该概率确定命名实体的垂直领域(专业领域);网络层又包含卷积层、池化层、特征连接层、全连接层四个部分;卷积层设计了两个通道,第一个通道的卷积窗大小为1,第二个通道的卷积窗大小为2,使CNN网络提取出待纠错文本中单个词汇和相邻词汇的特征;池化层采用最大池化获取卷积层输出的每个通道特征最明显的特征;特征连接层是将池化层输出的两个通道的特征拼接到一起得到特征矩阵;全连接层最后将特征连接层输出的特征矩阵进行分类,得到命名实体的类型,根据命名实体的类型将命名实体存储至{k,v}集合中,其中,k表示命名实体,v表示命名实体的类型。
103,从预设的领域知识图谱集中选取与垂直领域对应的领域知识图谱,并从领域知识图谱中选取与类型对应的候选实体;
从预设的领域知识图谱集中选取与命名实体的垂直领域对应的领域知识图谱,并从领域知识图谱中选取与命名实体的类型一致的候选实体。其中,领域知识图谱集包含多个领域知识图谱,候选实体是领域知识图谱中的命名实体。
在本实施例中,对于通用领域知识图谱,是显示知识发展进程与结构关系的一系列各种不同的图形,用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。知识图谱,是通过将应用数学、图形学、信息可视化技术、信息科学等学科的理论与方法与计量学引文分析、共现分析等方法结合,并利用可视化的图谱形象地展示学科的核心结构、发展历史、前沿领域以及整体知识架构达到多学科融合目的的现代理论。其中,特定领域的知识图谱,对知识的精确性要求较高,包括定义数据的概念、类别、关联、属性约束等。
104,计算命名实体与对应的候选实体之间的匹配度,并根据匹配度生成修正集合;
服务器计算命名实体与对应的候选实体之间的匹配度,根据匹配度生成修正集合。在本实施例中,根据与垂直领域对应的领域知识图谱(G),对集合{k,v},依次和G中类型v的候选实体(g)做比较,如果命名实体k和候选实体g完全匹配则说明待纠错文本并不需要做纠正,即命名实体和候选实体的匹配度越高,该命名实体的修正率越低。如果候选实体g与命名实体k并不完全匹配,则提取并汇集与命名实体k匹配度最大的候选实体g,形成修正集合C_k。其中,修正集合只包括候选实体g。
105,从修正集合中选取候选实体,对待纠错文本中对应的命名实体进行修正,得到 修正文本。
提取修正集合,判断该修正集合中是否包含多个候选实体。当修正集合中只包含一个候选实体,则说明该候选实体就是修正后的命名实体,即根据该候选实体,对命名实体进行修正。当修正集合中包含有多个候选实体时,服务器根据预设的领域语言模型,分别计算修正集合中各候选实体在待纠错文本中的出现概率。
当得到各候选实体对应的出现概率后,对各出现概率进行数值大小的比较,并对各候选实体按照对应的出现概率的比较结果进行排序,生成出现序列。根据排序的结果,从出现序列中选取出现概率最大的候选实体对待纠错文本的命名实体进行修正,从而得到修正文本。
在本实施例中,将需要进行修正的命名实体作为混淆词,并汇集成为混淆词典,调用混淆词典,对分词处理后的待纠错文本中的每个词语进行遍历,提高纠错效率和准确率。
在本申请实施例中,对待纠错文本进行分词处理得到命名实体,并从待纠错文本的垂直领域对应的领域知识图谱中选取与命名实体类型一致的候选实体,计算命名实体与候选实体的匹配度从而生成修正集合,根据该修正集合对待纠错文本进行修正。本实施例通过调用领域知识图谱,选取候选实体,对待纠错文本中出现的错误进行针对性的修正,从而提高了纠错效率和准确度。
请参阅图2,本申请实施例中文本纠错方法的第二个实施例包括:
201,获取待纠错文本,并根据预设的词典,将待纠错文本生成前缀树;
202,对前缀树进行词图扫描,生成有向无环图;
服务器获取待纠错文本,并调用中文分词工具(jieba)的词典作为对待纠错文本进行分词处理时所用的词典,把一些并不常用的词汇删除,尽可能保留正确且常用的词汇,以减小分词器的容量。调用该词典,将待纠错文本生成前缀树(trie树)。对trie树结构进行词图扫描,即把词典中词语放到一个trie树中,一个词语的前面几个字一样,就表示他们具有相同的前缀,就可以使用trie树来存储,提高查找速度。在本实施例中,将待纠错文本(由一个或多个语句组成)中的句子根据预设的词典进行词图扫描处理,生成有向无环图。前缀树的生成及词图扫描都采用的是现有技术,故在此不做细述。
203,调用预设的动态规划查找最大概率路径算法,从有向无环图中查找基于词频的最大切分组合;
查找待纠错文本中已经切分好的词语,并计算该词语出现的频率,如果没有该词,就把词典中出现频率最小的那个词语的频率作为该词的频率;然后根据动态规划查找最大概率路径的算法,对待纠错文本中的句子从右往左反向计算最大概率,因为通常情况下形容词太多,后面的才是主干。因此,从右往左计算,正确率要高于从左往右计算,这里类似于逆向最大匹配,P(NodeN)=1.0,P(NodeN-1)=P(NodeN)*Max(P(倒数第一个词))…依次类推,最后结合有向无环图得到最大概率路径,即得到最大概率的切分组合。
204,将待纠错文本按照最大切分组合进行分词处理,得到词序列;
205,将词序列输入预设的基于词序列的命名实体识别模型,输出得到命名实体集;
将待纠错文本按照所得到的最大切分组合进行分词处理,即按照字符组合进行分词,得到词序列。将词序列输入至服务器预设的基于词序列的命名实体识别模型,输出识别结果,即识别出待纠错文本中的命名实体,并汇集成命名实体集。本实施例中采用的是基于词序列的命名实体识别模型,模型输入为词序列,而不是字序列,因此可以提升识别效率,同时也可以减小占用内存。
在本实施例中,命名实体识别模型(Named Entity Recognition,简称NER)是信息提取、问答系统、句法分析、机器翻译等应用领域的重要基础工具,在自然语言处理技术走向实用化的过程中占有重要地位。一般来说,命名实体识别模型的任务就是识别出待处 理文本中三大类(实体类、时间类和数字类)、七小类(人名、机构名、地名、时间、日期、货币和百分比)命名实体。
206,将命名实体集输入至预设的卷积神经网络中,并调用卷积神经网络的逻辑回归函数,计算命名实体集中各命名实体所涉及的各个领域的领域属性值;
207,对各个领域属性值进行比较,将领域属性值最大的领域作为命名实体集中各命名实体的垂直领域;
将命名实体集输入至预设的卷积神经网络(CNN模型)中,调用卷积神经网络的逻辑回归函数(softmax)函数,计算命名实体集中各命名实体所涉及的各个领域的领域属性值。对各个领域的领域属性值进行比较,并从中选取领域属性值最大的领域作为该命名实体的垂直领域。
在本实施例中,卷积神经网络中的全连接层有两个隐藏层,而全连接层的输出层节点数与预设的命名实体的类型个数保持一致;CNN模型的输出层采用softmax函数来进行每个领域的概率的计算,即计算领域属性值。
208,基于卷积神经网络中的卷积层,提取命名实体集中各命名实体的类型特征信息;
209,计算类型特征信息与预设的类型之间的匹配度,根据匹配度确定命名实体的类型;
根据卷积神经网络中的卷积层,提取命名实体集中各命名实体的类型特征信息,并计算类型特征信息与预设的命名实体的类型之间的匹配度,即计算类型特征信息与预设类型之间的相似度,根据该相似度确定命名实体的类型。
在本实施例中,CNN模型输入层输入命名实体,维度为8*271;卷积层有两个通道,两个通道的卷积窗维度分别为1*271、2*271,每个通道都有512个卷积核。卷积层的输出分别为8*512、7*512的矩阵。池化层对卷积层的输出进行最大池化操作,输出1*512和1*512的两个类型特征信息,且该类型特征信息以特征向量的形式输出,因此,待纠错文本可得到1028种类型特征信息。特征连接层将池化层的两个输出拼接到一起形成一个1*1028的类型特征信息,将类型特征信息输入到CNN模型的全连接层中,输出命名实体的类型。
210,从预设的领域知识图谱集中选取与垂直领域对应的领域知识图谱,并从领域知识图谱中选取与类型对应的候选实体;
211,计算命名实体与对应的候选实体之间的匹配度,并根据匹配度生成修正集合;
212,从修正集合中选取候选实体,对待纠错文本中对应的命名实体进行修正,得到修正文本。
在本实施例中,步骤210-212与上述的文本纠错方法的第一个实施例中的步骤103-105一致,在此不再赘述。
在本申请实施例中,根据预设的词典将待纠错文本生成有向无环图,并调用预设的查找最大概率路径算法,从有向无环图中查找基于词频的最大切分组合,对该最大切分组合进行分词处理得到词序列并将其输入至基于词序列的命名实体识别模型中确定命名实体,对待纠错文本进行一系列处理生成词序列,再输入至命名实体识别模型中基于词序列进行命名实体的识别,从而可以根据命名实体准确、快速的定位错误,提高了纠错效率。
请参阅图3,本申请实施例中文本纠错方法的第三个实施例包括:
301,获取待纠错文本,并对待纠错文本进行分词处理,得到命名实体集;
302,将命名实体集输入至预设的卷积神经网络中进行领域识别,确定命名实体集中各命名实体的垂直领域及类型;
303,从预设的领域知识图谱集中选取与垂直领域对应的领域知识图谱,并从领域知识图谱中选取与类型对应的候选实体;
304,计算命名实体与对应的候选实体之间的字形相似度,若字形相似度大于预设的字形相似阈值,则汇集候选实体生成修正集合;
分别对命名实体与对应的候选实体的字形进行分析,并计算命名实体与候选实体之间的字形相似度,将字形相似度与预设的字形相似阈值进行比较,当字形相似度大于字形相似阈值时,则提取并汇集对应的候选实体,生成修正集合。
在本实施例中,每个命名实体均是由一个或多个字组成的,将该字称为目标字,并确定目标字的字图像和包含字形特征的字形向量;其中,可以基于卷积神经网络确定字图像的字形向量。并将目标字根据该目标字对应的多种字体的写法生成该目标字的字图像。具体的,确定目标字在不同字体下与目标字相对应的字体图像,对目标字所有的字体图像进行拼接处理,生成深度为D的字图像,D为目标字的字体图像的个数。利用多种字体的字体图像来生成包含字形特征的字形向量,使得目标字的字形向量包含多种字体的字形特征。同时,按照上述步骤确定候选实体的字形特征,并根据命名实体和候选实体的字形特征计算字形相似度,从而将字形相似度与字形相似阈值进行比较。另外,本实施例中的“字体”还可以包含不同历史时期的字体,比如金文、草书、魏碑等,只要是该字体可以包含字形特征即可。
305,或,计算命名实体与对应的候选实体之间的字音相似度,若字音相似度大于预设的字音相似阈值,则汇集候选实体生成修正集合;
对命名实体进行字音分析,并判断该命名实体是否为拼音汉字混合拼写。若命名实体为拼音汉字混合拼写,则对命名实体进行字音转换,规范化为拼音的字符串。具体的,服务器根据预设的拼音转化算法,对命名实体中的汉字进行字音转换,生成拼音,将命名实体中的其他拼音进行拼接,生成拼音串。
进一步的,分析命名实体和对应的候选实体的字音,并计算命名实体与候选实体之间的字音相似度,将字音相似度与预设的字音相似阈值进行比较,当字音相似度大于字形相似阈值时,则提取并汇集对应的候选实体,生成修正集合。
在本实施例中,制定音形码映射规则,通过将汉字的字音按照简单的替代规则部分映射到一个字符位,分为10部分。字音主要覆盖韵母,声母,补码以及声调的内容,覆盖4个字符位,第一位韵母位,从韵母“a”到“ong”24种韵母,由数字“1-9”和字母“A-K”代替,第二位是声母位,同样的,也是利用数字“1-9”和字母“A-J”代替,其中“Z”和“ZH”为相同转化;第四位是声调位,分别用“1-4”来替代汉字中的四个声调。根据音形码映射规则分别对命名实体和候选实体进行编码,并将编码后的结果进行相似度的比较,其中,比较两个编码之间的相似度采用距离算法,得到命名实体和候选实体的字音相似度。
306,或,分析命名实体与对应的候选实体的字词结构,并基于字词结构确定命名实体与候选实体之间的相似度,若相似度大于预设的字词结构相似阈值,则汇集候选实体生成修正集合;
分析命名实体和对应的候选实体的字词结构,并计算命名实体和候选实体之间字词结构的相似度,其中,字词结构包括字词组合和字序,在本实施例中,当命名实体和候选实体的字词组合的相似度大于预设的字词组合相似阈值时,则提取并汇集对应的候选实体,生成修正集合;或者,当命名实体和候选实体的字序的相似度大于预设的字序相似阈值时,则提取并汇集对应的候选实体,生成修正集合。
分别对命名实体与候选实体的字词组合进行分析,确定候选实体是否是由命名实体增加一个字或者减少一个字所构成,即确定该命名实体是否缺字或多字。计算命名实体和候选实体的字词组合的相似度,并将字词组合的相似度与预设的字词组合相似阈值进行比较,当字词组合的相似度大于字词组合相似阈值时,则提取并汇集对应的候选实体,生成修正集合。
具体的,服务器根据预设的编码规则,分别对命名实体和候选实体中的字或字符进行编码,即预设编码标识符对字或字符进行标识,从而将字或字符转换为对应的编码标识符,生成两个编码标识串,并将编码的结果进行相似度的比较,即比较两个编码标识串之间的编码标识符组成结构,在两个编码标识串编码标识符在排列顺序一致下,判断其个数是否一致,其中,比较两个编码标识串之间的相似度采用距离算法,得到命名实体和候选实体的字词组合的相似度。
另外,分别对命名实体和候选实体的字序进行分析,确定候选实体是否是由命名实体调整字序所构成,计算命名实体和候选实体的字序的相似度,并将字序的相似度与预设的字序相似阈值进行比较,当字序的相似度大于字序相似阈值时,则提取并汇集对应的候选实体,生成修正集合。
具体的,服务器根据预设的编码规则,分别对命名实体和候选实体中的字或字符进行编码,即预设编码标识符对字或字符进行标识,从而将字或字符转换为对应的编码标识符,生成两个编码标识串,并将编码的结果进行相似度的比较,即比较两个编码标识串之间的编码标识符组成结构,在两个编码标识串编码标识符的个数一致且编码标识符一致下,判断其排列顺序是否一致,其中,比较两个编码标识串之间的相似度采用距离算法,得到命名实体和候选实体的字序的相似度。
307,从修正集合中选取候选实体,对待纠错文本中对应的命名实体进行修正,得到修正文本。
提取修正集合,判断该修正集合中是否包含多个候选实体。当修正集合中只包含一个候选实体,则说明该候选实体就是修正后的命名实体,即根据该候选实体,对待纠错文本中对应的命名实体进行修正。当修正集合中包含有多个候选实体时,服务器根据预设的领域语言模型,分别计算修正集合中各候选实体在待纠错文本中的出现概率。
当得到各候选实体对应的出现概率后,对各出现概率进行数值大小的比较,并对各候选实体按照对应的出现概率的比较结果进行排序,生成出现序列。根据排序的结果,从出现序列中选取出现概率最大的候选实体对待纠错文本的命名实体进行修正,从而得到修正文本。
在本实施例中,将需要进行修正的命名实体作为混淆词,并汇集成为混淆词典,调用混淆词典,对分词处理后的待纠错文本中的每个词语进行遍历,提高纠错效率和准确率。
在本申请实施例中,步骤301-303与上述的文本纠错方法的第一个实施例中的步骤101-103一致,在此不做赘述。
在本申请实施例中,对命名实体和候选实体分别进行字形分析、字音分析和字词结构分析,从而可以识别文本的多种错误,对待纠错文本有针对性地进行修正,提高了文本纠错的准确度。
请参阅图4,本申请实施例中文本纠错方法的第四个实施例包括:
401,获取待纠错文本,并对待纠错文本进行分词处理,得到命名实体集;
402,将命名实体集输入至预设的卷积神经网络中进行领域识别,确定命名实体集中各命名实体的垂直领域及类型;
403,从预设的领域知识图谱集中选取与垂直领域对应的领域知识图谱,并从领域知识图谱中选取与类型对应的候选实体;
404,计算命名实体与对应的候选实体的匹配度,并根据匹配度生成修正集合;
405,判断修正集合是否包含多个候选实体;
406,若修正集合包含多个候选实体,则根据预设的领域语言模型,计算候选实体在待纠错文本中的出现概率;
提取修正集合,判断该修正集合中是否包含多个候选实体。当修正集合中只包含一个 候选实体,则说明该候选实体就是修正后的命名实体,即根据该候选实体,对命名实体进行修正。当修正集合中包含有多个候选实体时,服务器根据预设的领域语言模型,分别计算修正集合中各候选实体在待纠错文本中的出现概率。
407,对候选实体按照出现概率的大小进行排序,得到出现序列;
408,根据出现序列从修正集合中选取候选实体,对待纠错文本中对应的命名实体进行修正,得到修正文本。
当得到各候选实体对应的出现概率后,对各出现概率进行数值大小的比较,并对各候选实体按照对应的出现概率的比较结果进行排序,生成出现序列。根据排序的结果,从出现序列中选取出现概率最大的候选实体对待纠错文本的命名实体进行修正,从而得到修正文本。
在本申请实施例中,步骤401-404与上述的文本纠错方法的第一个实施例中的步骤101-104一致,在此不做赘述。
在本申请实施例中,计算候选实体在待纠错文本中的出现概率,并根据出现概率选取出现概率最大的候选实体对待纠错文本进行修正,提高了对待纠错文本中命名实体修正的准确度。
上面对本申请实施例中的文本纠错方法进行了描述,下面对本申请实施例中的文本纠错装置进行描述,请参照图5,本申请实施例中的文本纠错装置的一个实施例包括:
分词模块501,用于获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;
识别模块502,用于将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;
选取模块503,用于从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;
计算模块504,用于计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;
修正模块505,用于从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
在本申请实施例中,通过文本纠错装置对待纠错文本进行分词处理得到命名实体,并从待纠错文本的垂直领域对应的领域知识图谱中选取与命名实体类型一致的候选实体,计算命名实体与候选实体的匹配度从而生成修正集合,根据该修正集合对待纠错文本进行修正。本提案通过调用领域知识图谱,选取候选实体,对待纠错文本中出现的错误进行针对性的修正,从而提高了纠错效率和准确度。
请参阅图6,本申请实施例中的文本纠错装置的另一个实施例包括:
分词模块501,用于获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;
识别模块502,用于将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;
选取模块503,用于从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与类型对应的候选实体;
计算模块504,用于计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;
修正模块505,用于从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
其中,所述计算模块504包括:
第一计算单元5041,用于计算所述命名实体与对应的候选实体之间的字形相似度,若所述字形相似度大于预设的字形相似阈值,则汇集所述候选实体生成修正集合;
第二计算单元5042,用于计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合;
第三计算单元5043,用于分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合。
其中,所述计算模块504还包括转换单元5044,其具体用于:
判断所述命名实体是否为拼音汉字混合拼写;
若所述命名实体为拼音汉字混合拼写,则基于预设的拼音转化算法,将所述命名实体中的汉字对应转化为拼音。
其中,所述第三计算单元5043具体用于:
分析所述命名实体与对应的候选实体之间的字词组合,计算所述字词组合的相似度;若所述字词组合的相似度大于预设的字词组合相似阈值,则汇集所述候选实体生成修正集合;
分析所述命名实体与对应的候选实体之间的字序,计算所述字序的相似度;若所述字序的相似度大于预设的字序相似阈值,则汇集所述候选实体生成修正集合。
其中,所述修正模块505包括:
判断单元5051,用于判断所述修正集合是否包含多个所述候选实体;
计算单元5052,用于若所述修正集合包含多个所述候选实体,则根据预设的领域语言模型,计算所述候选实体在所述待纠错文本中的出现概率;
排序单元5053,用于对所述候选实体按照所述出现概率的大小进行排序,得到出现序列;
修正单元5054,用于根据所述出现序列从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
其中,所述分词模块501具体用于:
获取待纠错文本,并根据预设的词典,将所述待纠错文本生成前缀树;对所述前缀树进行词图扫描,生成有向无环图;
调用预设的动态规划查找最大概率路径算法,从所述有向无环图中查找基于词频的最大切分组合;
将所述待纠错文本按照所述最大切分组合进行分词处理,得到词序列;将所述词序列输入预设的基于词序列的命名实体识别模型,输出得到命名实体集。
其中,所述识别模块502具体用于:
将所述命名实体集输入至预设的卷积神经网络中,并调用所述卷积神经网络的逻辑回归函数,计算所述命名实体集中各命名实体所涉及的各个领域的领域属性值;
对各个所述领域属性值进行比较,将所述领域属性值最大的领域作为所述命名实体集中各命名实体的垂直领域;
基于所述卷积神经网络中的卷积层,提取所述命名实体集中各命名实体的类型特征信息;
计算所述类型特征信息与预设的类型之间的匹配度,根据所述匹配度确定所述命名实体的类型。
在本申请实施例中,通过文本纠错装置对待纠错文本进行一系列处理生成词序列,再输入至命名实体识别模型中基于词序列进行命名实体的识别,从而可以根据命名实体准确、快速的定位错误,对命名实体和候选实体分别进行字形分析、字音分析和字词结构分 析,从而可以识别文本的多种错误,对待纠错文本有针对性地进行修正,计算候选实体在待纠错文本中的出现概率,并根据出现概率选取出现概率最大的候选实体对待纠错文本进行修正,提高了对待纠错文本中命名实体修正的准确度。
请参阅图7,下面从硬件处理的角度对本申请实施例中的文本纠错设备的一个实施例进行详细描述。
图7是本申请实施例提供的一种文本纠错设备的结构示意图,该文本纠错设备700可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)710(例如,一个或一个以上处理器)和存储器720,一个或一个以上存储应用程序733或数据732的存储介质730(例如一个或一个以上海量存储设备)。其中,存储器720和存储介质730可以是短暂存储或持久存储。存储在存储介质730的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对文本纠错设备700中的一系列指令操作。更进一步地,处理器710可以设置为与存储介质730通信,在文本纠错设备700上执行存储介质730中的一系列指令操作。
文本纠错设备700还可以包括一个或一个以上电源740,一个或一个以上有线或无线网络接口750,一个或一个以上输入输出接口760,和或或,一个或一个以上操作系统731,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图7示出的文本纠错设备结构并不构成对文本纠错设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述文本纠错方法的步骤。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。
Claims (20)
- 一种文本纠错方法,其中,所述文本纠错方法包括:获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
- 根据权利要求1所述的文本纠错方法,其中,所述计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合包括:计算所述命名实体与对应的候选实体之间的字形相似度,若所述字形相似度大于预设的字形相似阈值,则汇集所述候选实体生成修正集合;或,计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合;或,分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合。
- 根据权利要求2所述的文本纠错方法,其中,在所述计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合之前,还包括:判断所述命名实体是否为拼音汉字混合拼写;若是,则基于预设的拼音转化算法,将所述命名实体中的汉字对应转化为拼音。
- 根据权利要求2所述的文本纠错方法,其中,所述分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合包括:分析所述命名实体与对应的候选实体的字词组合,计算所述字词组合的相似度;若所述字词组合的相似度大于预设的字词组合相似阈值,则汇集所述候选实体生成修正集合;或,分析所述命名实体与对应的候选实体的字序,计算所述字序的相似度;若所述字序的相似度大于预设的字序相似阈值,则汇集所述候选实体生成修正集合。
- 根据权利要求1-4中任一项所述的文本纠错方法,其中,所述从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本包括:判断所述修正集合是否包含多个所述候选实体;若是,则根据预设的领域语言模型,计算所述候选实体在所述待纠错文本中的出现概率;对所述候选实体按照所述出现概率的大小进行排序,得到出现序列;根据所述出现序列从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
- 根据权利要求5所述的文本纠错方法,其中,所述获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集包括:获取待纠错文本,并根据预设的词典,将所述待纠错文本生成前缀树;对所述前缀树进行词图扫描,生成有向无环图;调用预设的动态规划查找最大概率路径算法,从所述有向无环图中查找基于词频的最大切分组合;将所述待纠错文本按照所述最大切分组合进行分词处理,得到词序列;将所述词序列输入预设的基于词序列的命名实体识别模型,输出得到命名实体集。
- 根据权利要求5所述的文本纠错方法,其中,所述将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型包括:将所述命名实体集输入至预设的卷积神经网络中,并调用所述卷积神经网络的逻辑回归函数,计算所述命名实体集中各命名实体所涉及的各个领域的领域属性值;对各个所述领域属性值进行比较,将所述领域属性值最大的领域作为所述命名实体集中各命名实体的垂直领域;基于所述卷积神经网络中的卷积层,提取所述命名实体集中各命名实体的类型特征信息;计算所述类型特征信息与预设的类型之间的匹配度,根据所述匹配度确定所述命名实体的类型。
- 一种文本纠错设备,其中,所述文本纠错设备包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述网络访问探测设备执行如下所述的文本纠错方法的步骤:获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
- 根据权利要求8所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行实现所述计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合的步骤时,还执行以下步骤:计算所述命名实体与对应的候选实体之间的字形相似度,若所述字形相似度大于预设的字形相似阈值,则汇集所述候选实体生成修正集合;或,计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合;或,分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合。
- 根据权利要求9所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行实现在所述计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合的步骤之前,还执行以下步骤:判断所述命名实体是否为拼音汉字混合拼写;若是,则基于预设的拼音转化算法,将所述命名实体中的汉字对应转化为拼音。
- 根据权利要求9所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行实现所述分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合的步骤时,还执行以下步骤:分析所述命名实体与对应的候选实体的字词组合,计算所述字词组合的相似度;若所述字词组合的相似度大于预设的字词组合相似阈值,则汇集所述候选实体生成修正集合;或,分析所述命名实体与对应的候选实体的字序,计算所述字序的相似度;若所述字序的相似度大于预设的字序相似阈值,则汇集所述候选实体生成修正集合。
- 根据权利要求8-11中任一项所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行实现所述从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本的步骤时,还执行以下步骤:判断所述修正集合是否包含多个所述候选实体;若是,则根据预设的领域语言模型,计算所述候选实体在所述待纠错文本中的出现概率;对所述候选实体按照所述出现概率的大小进行排序,得到出现序列;根据所述出现序列从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
- 根据权利要求12所述的文本纠错设备,其中,所述文本纠错程序被所述处理器执行实现所述获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集的步骤时,还执行以下步骤:获取待纠错文本,并根据预设的词典,将所述待纠错文本生成前缀树;对所述前缀树进行词图扫描,生成有向无环图;调用预设的动态规划查找最大概率路径算法,从所述有向无环图中查找基于词频的最大切分组合;将所述待纠错文本按照所述最大切分组合进行分词处理,得到词序列;将所述词序列输入预设的基于词序列的命名实体识别模型,输出得到命名实体集。
- 一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,其中,所述指令被处理器执行时实现如下所述的文本纠错方法的步骤:获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集;将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
- 根据权利要求14所述的计算机可读存储介质,其中,所述计算机程序被处理器执行所述计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合的步骤时,还执行如下步骤:计算所述命名实体与对应的候选实体之间的字形相似度,若所述字形相似度大于预设的字形相似阈值,则汇集所述候选实体生成修正集合;或,计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合;或,分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合。
- 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行在所述计算所述命名实体与对应的候选实体之间的字音相似度,若所述字音相似度大于预设的字音相似阈值,则汇集所述候选实体生成修正集合的步骤之前,还执行如下步骤:判断所述命名实体是否为拼音汉字混合拼写;若是,则基于预设的拼音转化算法,将所述命名实体中的汉字对应转化为拼音。
- 根据权利要求15所述的计算机可读存储介质,其中,所述计算机程序被处理器执行所述分析所述命名实体与对应的候选实体的字词结构,并基于所述字词结构确定所述命名实体与所述候选实体之间的相似度,若所述相似度大于预设的字词结构相似阈值,则汇集所述候选实体生成修正集合的步骤时,还执行如下步骤:分析所述命名实体与对应的候选实体的字词组合,计算所述字词组合的相似度;若所述字词组合的相似度大于预设的字词组合相似阈值,则汇集所述候选实体生成修正集合;或,分析所述命名实体与对应的候选实体的字序,计算所述字序的相似度;若所述字序的相似度大于预设的字序相似阈值,则汇集所述候选实体生成修正集合。
- 根据权利要求14-17中任一项所述的计算机可读存储介质,其中,所述计算机程序被处理器执行所述从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本的步骤时,还执行如下步骤:判断所述修正集合是否包含多个所述候选实体;若是,则根据预设的领域语言模型,计算所述候选实体在所述待纠错文本中的出现概率;对所述候选实体按照所述出现概率的大小进行排序,得到出现序列;根据所述出现序列从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
- 根据权利要求18所述的计算机可读存储介质,其中,所述计算机程序被处理器执行所述获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体集的步骤时,还执行如下步骤:获取待纠错文本,并根据预设的词典,将所述待纠错文本生成前缀树;对所述前缀树进行词图扫描,生成有向无环图;调用预设的动态规划查找最大概率路径算法,从所述有向无环图中查找基于词频的最大切分组合;将所述待纠错文本按照所述最大切分组合进行分词处理,得到词序列;将所述词序列输入预设的基于词序列的命名实体识别模型,输出得到命名实体集。
- 一种文本纠错装置,其中,所述文本纠错装置包括:分词模块,用于获取待纠错文本,并对所述待纠错文本进行分词处理,得到命名实体;识别模块,用于将所述命名实体集输入至预设的卷积神经网络中进行领域识别,确定所述命名实体集中各命名实体的垂直领域及类型;选取模块,用于从预设的领域知识图谱集中选取与所述垂直领域对应的领域知识图谱,并从所述领域知识图谱中选取与所述类型对应的候选实体;计算模块,用于计算所述命名实体与对应的候选实体之间的匹配度,并根据所述匹配度生成修正集合;修正模块,用于从所述修正集合中选取候选实体,对所述待纠错文本中对应的命名实体进行修正,得到修正文本。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110873540.0 | 2021-07-30 | ||
CN202110873540.0A CN113591457B (zh) | 2021-07-30 | 2021-07-30 | 文本纠错方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023005293A1 true WO2023005293A1 (zh) | 2023-02-02 |
Family
ID=78252803
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/088892 WO2023005293A1 (zh) | 2021-07-30 | 2022-04-25 | 文本纠错方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113591457B (zh) |
WO (1) | WO2023005293A1 (zh) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116306598A (zh) * | 2023-05-22 | 2023-06-23 | 上海蜜度信息技术有限公司 | 针对不同领域字词的定制化纠错方法、系统、设备及介质 |
CN116341543A (zh) * | 2023-05-31 | 2023-06-27 | 安徽商信政通信息技术股份有限公司 | 一种人名识别与纠错的方法、系统、设备及存储介质 |
CN117454884A (zh) * | 2023-12-20 | 2024-01-26 | 上海蜜度科技股份有限公司 | 历史人物信息纠错方法、系统、电子设备和存储介质 |
CN117556363A (zh) * | 2024-01-11 | 2024-02-13 | 中电科大数据研究院有限公司 | 基于多源数据联合检测的数据集异常识别方法 |
CN118072761A (zh) * | 2024-01-31 | 2024-05-24 | 北京语言大学 | 一种大模型发音偏误检测及发音动作图像反馈方法及装置 |
CN118152381A (zh) * | 2023-12-20 | 2024-06-07 | 深圳计算科学研究院 | 结构化数据的实体纠错方法、装置、设备及介质 |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113591457B (zh) * | 2021-07-30 | 2023-10-24 | 平安科技(深圳)有限公司 | 文本纠错方法、装置、设备及存储介质 |
CN114186022A (zh) * | 2021-12-02 | 2022-03-15 | 国网山东省电力公司信息通信公司 | 基于语音转录与知识图谱的调度指令质检方法及系统 |
CN114817465A (zh) * | 2022-04-14 | 2022-07-29 | 海信电子科技(武汉)有限公司 | 一种用于多语言语义理解的实体纠错方法及智能设备 |
CN116227479B (zh) * | 2022-12-29 | 2024-05-17 | 易方达基金管理有限公司 | 一种实体识别方法、装置、计算机设备和可读存储介质 |
CN116010626B (zh) * | 2023-03-24 | 2023-06-27 | 南方电网数字电网研究院有限公司 | 电力用户知识图谱分析方法、装置和计算机设备 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800407A (zh) * | 2017-11-15 | 2019-05-24 | 腾讯科技(深圳)有限公司 | 意图识别方法、装置、计算机设备和存储介质 |
CN110147549A (zh) * | 2019-04-19 | 2019-08-20 | 阿里巴巴集团控股有限公司 | 用于执行文本纠错的方法和系统 |
CN110597992A (zh) * | 2019-09-10 | 2019-12-20 | 腾讯科技(深圳)有限公司 | 基于知识图谱的语义推理方法及装置、电子设备 |
CN110750993A (zh) * | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | 分词方法及分词器、命名实体识别方法及系统 |
CN111291571A (zh) * | 2020-01-17 | 2020-06-16 | 华为技术有限公司 | 语义纠错方法、电子设备及存储介质 |
US20210050017A1 (en) * | 2019-08-13 | 2021-02-18 | Samsung Electronics Co., Ltd. | System and method for modifying speech recognition result |
CN113591457A (zh) * | 2021-07-30 | 2021-11-02 | 平安科技(深圳)有限公司 | 文本纠错方法、装置、设备及存储介质 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104217717B (zh) * | 2013-05-29 | 2016-11-23 | 腾讯科技(深圳)有限公司 | 构建语言模型的方法及装置 |
WO2019182974A2 (en) * | 2018-03-21 | 2019-09-26 | Nvidia Corporation | Stereo depth estimation using deep neural networks |
CN109885660B (zh) * | 2019-02-22 | 2020-10-02 | 上海乐言信息科技有限公司 | 一种知识图谱赋能的基于信息检索的问答系统和方法 |
CN111191051B (zh) * | 2020-04-09 | 2020-07-28 | 速度时空信息科技股份有限公司 | 一种基于中文分词技术的应急知识图谱的构建方法及系统 |
CN112528663B (zh) * | 2020-12-18 | 2024-02-20 | 中国南方电网有限责任公司 | 一种电网领域调度场景下的文本纠错方法及系统 |
CN112685550B (zh) * | 2021-01-12 | 2023-08-04 | 腾讯科技(深圳)有限公司 | 智能问答方法、装置、服务器及计算机可读存储介质 |
-
2021
- 2021-07-30 CN CN202110873540.0A patent/CN113591457B/zh active Active
-
2022
- 2022-04-25 WO PCT/CN2022/088892 patent/WO2023005293A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109800407A (zh) * | 2017-11-15 | 2019-05-24 | 腾讯科技(深圳)有限公司 | 意图识别方法、装置、计算机设备和存储介质 |
CN110147549A (zh) * | 2019-04-19 | 2019-08-20 | 阿里巴巴集团控股有限公司 | 用于执行文本纠错的方法和系统 |
US20210050017A1 (en) * | 2019-08-13 | 2021-02-18 | Samsung Electronics Co., Ltd. | System and method for modifying speech recognition result |
CN110597992A (zh) * | 2019-09-10 | 2019-12-20 | 腾讯科技(深圳)有限公司 | 基于知识图谱的语义推理方法及装置、电子设备 |
CN110750993A (zh) * | 2019-10-15 | 2020-02-04 | 成都数联铭品科技有限公司 | 分词方法及分词器、命名实体识别方法及系统 |
CN111291571A (zh) * | 2020-01-17 | 2020-06-16 | 华为技术有限公司 | 语义纠错方法、电子设备及存储介质 |
CN113591457A (zh) * | 2021-07-30 | 2021-11-02 | 平安科技(深圳)有限公司 | 文本纠错方法、装置、设备及存储介质 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116306598A (zh) * | 2023-05-22 | 2023-06-23 | 上海蜜度信息技术有限公司 | 针对不同领域字词的定制化纠错方法、系统、设备及介质 |
CN116306598B (zh) * | 2023-05-22 | 2023-09-08 | 上海蜜度信息技术有限公司 | 针对不同领域字词的定制化纠错方法、系统、设备及介质 |
CN116341543A (zh) * | 2023-05-31 | 2023-06-27 | 安徽商信政通信息技术股份有限公司 | 一种人名识别与纠错的方法、系统、设备及存储介质 |
CN116341543B (zh) * | 2023-05-31 | 2023-09-19 | 安徽商信政通信息技术股份有限公司 | 一种人名识别与纠错的方法、系统、设备及存储介质 |
CN117454884A (zh) * | 2023-12-20 | 2024-01-26 | 上海蜜度科技股份有限公司 | 历史人物信息纠错方法、系统、电子设备和存储介质 |
CN117454884B (zh) * | 2023-12-20 | 2024-04-09 | 上海蜜度科技股份有限公司 | 历史人物信息纠错方法、系统、电子设备和存储介质 |
CN118152381A (zh) * | 2023-12-20 | 2024-06-07 | 深圳计算科学研究院 | 结构化数据的实体纠错方法、装置、设备及介质 |
CN117556363A (zh) * | 2024-01-11 | 2024-02-13 | 中电科大数据研究院有限公司 | 基于多源数据联合检测的数据集异常识别方法 |
CN117556363B (zh) * | 2024-01-11 | 2024-04-09 | 中电科大数据研究院有限公司 | 基于多源数据联合检测的数据集异常识别方法 |
CN118072761A (zh) * | 2024-01-31 | 2024-05-24 | 北京语言大学 | 一种大模型发音偏误检测及发音动作图像反馈方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN113591457B (zh) | 2023-10-24 |
CN113591457A (zh) | 2021-11-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023005293A1 (zh) | 文本纠错方法、装置、设备及存储介质 | |
CN108416058B (zh) | 一种基于Bi-LSTM输入信息增强的关系抽取方法 | |
CN109960728B (zh) | 一种开放域会议信息命名实体识别方法及系统 | |
CN113901797B (zh) | 文本纠错方法、装置、设备及存储介质 | |
CN110750993A (zh) | 分词方法及分词器、命名实体识别方法及系统 | |
CN116151132B (zh) | 一种编程学习场景的智能代码补全方法、系统及储存介质 | |
CN110782892B (zh) | 语音文本纠错方法 | |
US20200104635A1 (en) | Invertible text embedding for lexicon-free offline handwriting recognition | |
CN111460793A (zh) | 纠错方法、装置、设备及存储介质 | |
CN110245349B (zh) | 一种句法依存分析方法、装置及一种电子设备 | |
CN112784576B (zh) | 一种文本依存句法分析方法 | |
CN111882462B (zh) | 一种面向多要素审查标准的中文商标近似检测方法 | |
CN111782892B (zh) | 基于前缀树的相似字符识别方法、设备、装置和存储介质 | |
CN115759119B (zh) | 一种金融文本情感分析方法、系统、介质和设备 | |
CN114996467A (zh) | 基于语义相似度的知识图谱实体属性对齐算法 | |
Rama et al. | Fast and unsupervised methods for multilingual cognate clustering | |
Araujo | Part-of-speech tagging with evolutionary algorithms | |
CN113779992B (zh) | 基于词汇增强和预训练的BcBERT-SW-BiLSTM-CRF模型的实现方法 | |
CN112528003B (zh) | 一种基于语义排序和知识修正的多项选择问答方法 | |
CN114528368A (zh) | 基于预训练语言模型与文本特征融合的空间关系抽取方法 | |
CN113536776B (zh) | 混淆语句的生成方法、终端设备及计算机可读存储介质 | |
CN115329783A (zh) | 一种基于跨语言预训练模型的藏汉神经机器翻译方法 | |
CN110245331A (zh) | 一种语句转换方法、装置、服务器及计算机存储介质 | |
CN114579763A (zh) | 一种针对中文文本分类任务的字符级对抗样本生成方法 | |
CN114417816A (zh) | 文本评分方法、文本评分模型、文本评分设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22847911 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22847911 Country of ref document: EP Kind code of ref document: A1 |