CN110134970A - Header error correction method and apparatus - Google Patents

Header error correction method and apparatus Download PDF

Info

Publication number
CN110134970A
CN110134970A CN201910617118.1A CN201910617118A CN110134970A CN 110134970 A CN110134970 A CN 110134970A CN 201910617118 A CN201910617118 A CN 201910617118A CN 110134970 A CN110134970 A CN 110134970A
Authority
CN
China
Prior art keywords
word
title
error correction
recalls
segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910617118.1A
Other languages
Chinese (zh)
Other versions
CN110134970B (en
Inventor
邓卓彬
罗希意
赖佳伟
付志宏
何径舟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201910617118.1A priority Critical patent/CN110134970B/en
Publication of CN110134970A publication Critical patent/CN110134970A/en
Application granted granted Critical
Publication of CN110134970B publication Critical patent/CN110134970B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the present invention proposes that a kind of header error correction method and apparatus, method include: to obtain to recall word to first of each word segment in error correction title based on corpus;Retrieval request is sent to title search library according to error correction title, the similar title to error correction title is obtained from title search library, title search library carries out data update when receiving retrieval request;Obtain each word segment according to similar title second recalls word;Word and second, which is recalled, to the first of each word segment recalls word progress feature calculation;Determine that the candidate of each word segment recalls word;Candidate based on each word segment recalls word and treats the progress error correction of error correction title.The similar title that the embodiment of the present invention is obtained from default title search library obtains second and recalls word, can effectively make up and recall the word problem that may be present for recalling deficiency by the first of corpus acquisition.Word and second is recalled by first to recall word and may be implemented comprehensively to treat error correction title from different dimensions and carry out error correction.

Description

Header error correction method and apparatus
Technical field
The present invention relates to text recognition technique field more particularly to a kind of header error correction method and apparatus.
Background technique
Current text error correction solution can be good at solving the wrong word error correction of universal class, but for comprising knowing The text Error Correcting Problem for knowing category information is not well solved.For example, when content of text includes star's name, and name In some word when being wrongly write, if the wrong word error correction based on universal class may not be able to find that star's name is write It is wrong.The mistake of name writes the scope that might not belong to wrong word.So as to cause text content, there is no by complete error correction success. It follows that existing error correction scheme can not the very good solution above problem.
Summary of the invention
The embodiment of the present invention provides a kind of header error correction method and apparatus, to solve one or more skills in the prior art Art problem.
In a first aspect, the embodiment of the invention provides a kind of header error correction methods, comprising:
Based on corpus, obtains and recall word to first of each word segment in error correction title;
According to described to error correction title, retrieval request is sent to title search library, described in obtaining from the title search library To the similar title of error correction title, the title search library carries out data update when receiving the retrieval request;
According to the similar title, obtain each word segment second recalls word;
Word and second is recalled to the first of each word segment to recall word and carry out feature calculation respectively;
According to feature calculation as a result, determining that the candidate of each word segment recalls word;
Candidate based on each word segment recalls word and carries out error correction to error correction title to described.
In one embodiment, constructing the title search library includes:
The title search library is constructed, the existing article includes title according to existing article resource based on inverted index mode And word content;
Obtain the existing article resource when title search library receives the retrieval request;
The existing article resource when retrieval request is received according to the title search library, updates the title search library.
In one embodiment, the similar title to error correction title is obtained from the title search library, comprising:
According to the content of text for including to error correction title, obtain with described from the title search library to error correction title phase At least one approximate similar title.
In one embodiment, according to the similar title, obtain each word segment second recalls word, comprising:
According to described to error correction title, alignment information is obtained;
According to the alignment information, second that each word segment is obtained from the similar title recalls word.
In one embodiment, word and second is recalled to the first of each word segment to recall word and carry out spy respectively Sign calculates, comprising:
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the first of each word segment And attributive character carries out feature calculation;
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the second of each word segment And attributive character carries out feature calculation.
In one embodiment, it according to feature calculation as a result, determine that the candidate of each word segment recalls word, wraps It includes:
Word and described second, which is recalled, according to described first recalls the feature calculation of word as a result, using tree-model and standard sorted GBRank algorithm recalls word and second to first and recalls word and carry out marking sequence;
According to marking ranking results, word and second is recalled from described first and is recalled and is selected candidate in word and recall word.
In one embodiment, the candidate based on each word segment recalls word and carries out to described to error correction title Error correction, comprising:
The candidate corresponding to the word segment in error correction title recall word it is inconsistent in the case where, will be described Word segment replaces with the candidate and recalls word.
Second aspect, the embodiment of the invention provides a kind of header error correction devices, comprising:
First obtains module, for being based on corpus, obtains and recalls word to first of each word segment in error correction title;
Second obtains module, for, to error correction title, sending retrieval request to title search library according to described, with from the title The similar title to error correction title, title search library progress when receiving the retrieval request are obtained in search library Data update;
Third obtains module, for according to the similar title, obtain each word segment second to recall word;
Computing module recalls word and carries out feature calculation respectively for recalling word and second to the first of each word segment;
Determining module, for according to feature calculation as a result, determining that the candidate of each word segment recalls word;
Correction module recalls word for the candidate based on each word segment and carries out error correction to error correction title to described.
In one embodiment, further includes:
Construct module, for being based on inverted index mode, according to existing article resource, construct the title search library, it is described Having article includes title and word content;
4th obtains module, for obtaining the existing article resource when title search library receives the retrieval request;
Update module updates institute for receiving the existing article resource when retrieval request according to the title search library State title search library.
In one embodiment, the second acquisition module includes:
Similar title acquisition submodule, the content of text for including to error correction title according to, from the title search library It is middle to obtain and described at least one similar title similar to error correction title.
In one embodiment, the third acquisition module includes:
Alignment information acquisition submodule, for, to error correction title, obtaining alignment information according to described;
Second recalls word acquisition submodule, for obtaining each word from the similar title according to the alignment information The second of language segment recalls word.
In one embodiment, the computing module includes:
First computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the first of each word segment The aspect of model, term vector feature and attributive character carry out feature calculation;
Second computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the second of each word segment The aspect of model, term vector feature and attributive character carry out feature calculation.
In one embodiment, the determining module includes:
Sorting sub-module recalls the feature calculation of word as a result, using tree mould for recalling word and described second according to described first Type and standard sorted GBRank algorithm recall word and second to first and recall word and carry out marking sequence;
Submodule is selected, for word and second being recalled from described first and recalling and select candidate in word according to marking ranking results Recall word.
In one embodiment, the correction module includes:
Error correction submodule, it is inconsistent for recalling word in the candidate corresponding to the word segment in error correction title In the case where, the word segment is replaced with into the candidate and recalls word.
The third aspect, the embodiment of the invention provides a kind of header error correction terminal, the function of the header error correction terminal can Corresponding software realization can also be executed by hardware by hardware realization.The hardware or software include one or more Module corresponding with above-mentioned function.
It is described to deposit including processor and memory in the structure of the header error correction terminal in a possible design Reservoir is used to store the program for supporting the header error correction terminal to execute above-mentioned header error correction method, the processor is configured to For executing the program stored in the memory.The header error correction terminal can also include communication interface, be used for and other Equipment or communication.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, for storing header error correction end End computer software instructions used comprising for executing program involved in above-mentioned header error correction method.
A technical solution in above-mentioned technical proposal have the following advantages that or the utility model has the advantages that the embodiment of the present invention from default The similar title that title search library obtains obtains second and recalls word, can effectively make up and recall word by the first of corpus acquisition It is that may be present to recall insufficient problem.By first recall word and second recall word may be implemented it is comprehensively right from different dimensions Error correction is carried out to error correction title.
Above-mentioned general introduction is merely to illustrate that the purpose of book, it is not intended to be limited in any way.Except foregoing description Schematical aspect, except embodiment and feature, by reference to attached drawing and the following detailed description, the present invention is further Aspect, embodiment and feature, which will be, to be readily apparent that.
Detailed description of the invention
In the accompanying drawings, unless specified otherwise herein, otherwise indicate the same or similar through the identical appended drawing reference of multiple attached drawings Component or element.What these attached drawings were not necessarily to scale.It should be understood that these attached drawings depict only according to the present invention Disclosed some embodiments, and should not serve to limit the scope of the present invention.
Fig. 1 shows the flow chart of header error correction method according to an embodiment of the present invention.
Fig. 2 shows the flow charts of the building title search library of header error correction method according to an embodiment of the present invention.
Fig. 3 shows the flow chart of header error correction method according to another embodiment of the present invention.
Fig. 4 shows the specific flow chart of the step S300 of header error correction method according to an embodiment of the present invention.
Fig. 5 shows the specific flow chart of the step S400 of header error correction method according to an embodiment of the present invention.
Fig. 6 shows the specific flow chart of the step S500 of header error correction method according to an embodiment of the present invention.
Fig. 7 shows the flow chart of application example according to an embodiment of the present invention.
Fig. 8 shows the structural block diagram of header error correction device according to an embodiment of the present invention.
Fig. 9 shows the structural block diagram of header error correction device according to another embodiment of the present invention.
Figure 10 shows the structural block diagram of the second acquisition module of header error correction device according to an embodiment of the present invention.
The third that Figure 11 shows header error correction device according to an embodiment of the present invention obtains the structural block diagram of module.
Figure 12 shows the structural block diagram of the computing module of header error correction device according to an embodiment of the present invention.
Figure 13 shows the structural block diagram of the determining module of header error correction device according to an embodiment of the present invention.
Figure 14 shows the structural block diagram of the correction module of header error correction device according to an embodiment of the present invention.
Figure 15 shows the structural schematic diagram of header error correction terminal according to an embodiment of the present invention.
Specific embodiment
Hereinafter, certain exemplary embodiments are simply just described.As one skilled in the art will recognize that Like that, without departing from the spirit or scope of the present invention, described embodiment can be modified by various different modes. Therefore, attached drawing and description are considered essentially illustrative rather than restrictive.
Fig. 1 shows the flow chart of header error correction method according to an embodiment of the present invention.As shown in Figure 1, the header error correction side Method includes:
S100: being based on corpus, obtains and recalls word to first of each word segment in error correction title.Corpus can use Any corpus applied in existing text error correcting technique.The step can be realized according to existing error correction process to error correction title Global statistics and whole error correction.Based on corpus, the first of each word segment got, which recalls word, to be one or more It is a.Each word segment is not quite similar as the first candidate quantity for recalling word.For example, some word segments have multiple first Word is recalled as candidate, and some word segments may only include one first and recall word as candidate.It should be noted that the One to recall in word may include word segment itself.
In one example, it can be obtained according to each words context logic relationship in error correction title based on corpus The first of each word segment is taken to recall word.Can also according to word segment each in error correction title whether be wrong word logic Relationship, obtain each word segment based on corpus first recall word.
It should be noted that the corpus referred in the present embodiment is segmented into multiple types.Such as: (1) heterogeneous language Expect library: without specific Corpus Selection Rule, collecting extensively and store various corpus as former state.(2) it the corpus of homogeneity: only collects The corpus of same class content.(3) corpus of system: corpus is collected according to predetermined principle and ratio, there is corpus Balance and systematicness can represent the linguistic fact in a certain range.(4) it dedicated corpus: only collects and is used for a certain spy Determine the corpus of purposes.In addition to this, according to the languages of corpus, corpus is also segmented into the corpus of single language, bilingual corpus Library and multi-lingual corpus.According to the acquisition units of corpus, corpus can be divided into the corpus of the corpus of a language piece, sentence again Library, phrase corpus.Bilingual and multi-lingual corpus is further divided into parallel (alignment) corpus according to the organizational form of corpus With compare corpus.
S200: according to error correction title, retrieval request is sent to title search library, to obtain from title search library wait entangle The similar title of wrong title.Title search library carries out data update when receiving retrieval request.Since title search library is connecing It receives when the retrieval request of error correction title, will do it data update, therefore ensure that the data resource in title search library begins It is eventually latest data resource, so as to more accurately find the similar title to error correction title from title search library.Phase Like title to can be in mark of many aspects such as sentence structure, content of text and semanteme as similar evaluation to error correction title It is quasi-.
In one example, the title in current existing any field can be collected in title search library.Such as news Report title, article title, book title etc..Article content corresponding with the title stored in title search library can be together It is stored in title search library, can also be obtained from cloud.
In another example, judge whether the title to store in error correction title and title search library is similar, Ke Yitong It crosses vector space model and carries out similarity calculation.Substantially process are as follows: count basic language unit (word, phrase, the phrase of title Deng), and assign certain weight.Each title forms a vector to indicate the title, and then the similitude between title passes through The distance of vector indicates.
S300: according to similar title, obtain each word segment second recalls word.The of each word segment obtained Two recall word can for one or more.The quantity that each word segment corresponding second recalls word can be different.For example, have Word segment has multiple second to recall word as candidate, and some word segments may only include one second and recall word as time Choosing.Second to recall in word may include word segment itself, also may include the first of the word segment recalling word.Due to second Recalling word is obtained based on similar title, and the acquisition logic and dimension for recalling word with first are different, therefore second recalls Word can be used for abundant and supplement first and recall word.So as to excavate more parallel corporas of each word segment.
S400: word and second is recalled to the first of each word segment and recalls word and carries out feature calculation respectively.First recalls Word and the second specific features for recalling in word for carrying out feature calculation can be selected according to demand.Specific features can be managed Xie Weiyu first recalls word and second and recalls the associated feature of word.
In one example, it first recalls word and second and recalls word and carry out the specific features phase that is utilized when feature calculation Together.
S500: according to feature calculation as a result, determining that the candidate of each word segment recalls word.Candidate recalls word and calls together from first It returns word or second recalls in word and be determined.When word segment only correspond to one first recall word and second recall word when, then from It recalls for two and selects one in word as candidate and recall word.It is recalled when word segment corresponds to multiple first and recalls word and multiple second When word, then optimal one is selected in word as candidate recall word from respectively recalling.
S600: the candidate based on each word segment recalls word and treats the progress error correction of error correction title.Pass through each word piece The corresponding candidate of section, which recalls word, can determine to which whether the word segment in error correction title needs to be replaced, and need to replace The words changed into.For example, identical if with corresponding candidate recalling word to word segment in error correction title, then it is assumed that the word piece Section is correct.If in error correction title some word segment and corresponding candidate to recall word not identical, then it is assumed that the word piece Corresponding candidate is recalled word and is substituted into in error correction title, to complete error correction by core dumped.
In one embodiment, as shown in Fig. 2, header error correction method further includes constructing the process of title search library, packet It includes:
S700: being based on inverted index mode, according to existing article resource, constructs title search library, existing article include title and Word content.Existing article may include any headed article of tool such as news report, periodical, paper, model.Existing article It can be obtained and be updated by various ways such as server, cloud, database, big datas.
In one example, it is that the title is corresponding complete that existing article, which includes the word content in title and word content, Word content is also possible to embody the word content index information of complete word content.For example, title search library can be only root Index is simply counted and is established according to the title data of article.And article content corresponding with title can be used as in text The mode for holding index information carries out quick obtaining from cloud by establishing association, to reduce the difficulty of building title search library Degree, and reduce the maintenance cost of title search library.
In one example, title search library can be based on inverted index and global search technology, according to existing article Title and article content are constructed.Although being stored with article content data, maintenance cost and retrieval in title search library Workload is still far below existing huge knowledge base.
Inverted index is also commonly known as reverse indexing, merging archives or reversed archives, is a kind of indexing means, is used to deposit Store up the mapping of storage location of some word in a document or one group of document under full-text search.It is file retrieval system Most common data structure in system.It can include the lists of documents of this word according to word quick obtaining by inverted index. Inverted index is mainly made of two parts: " word lexicon " and " inverted file ".
In one example, title search library can be according to the text got within the scope of certain time before building search library Zhang Ziyuan is constructed.
S800: existing article resource when title search library receives retrieval request is obtained.Due to receiving retrieval request Time is the time for lagging behind building search library, therefore can be by the time between building search library and reception retrieval request The new existing article resource that may be generated in section.Also, when title search library receives new retrieval request again, new There is also the regular hour is poor between retrieval request and a upper retrieval request, thus may also can be generated in the time difference it is new There is article resource.And update existing article resource be possible to can exist with to the associated similar title of error correction title.
S900: existing article resource when retrieval request, more new title search library are received according to title search library.Pass through The new existing article resource generated in period after the last update of distance can be added to by more new title search library In title search library.So that the resource in title search library be able to maintain always it is newest.Improve title search library when Effect property, and can more accurately be got from title search library according to retrieval request associated similar to error correction title Title.
In one embodiment, it as shown in figure 3, according to error correction title, obtains from title search library to error correction mark The similar title of topic, comprising:
S210: it according to the content of text for including to error correction title, is obtained from title search library similar with to error correction title At least one similar title.Due to being stored with headed content of text in title search library, by error correction title Content of text can quickly find corresponding similar title.Without being subjected to the sub- constituent of distich, morphology and part of speech Deng the complicated analyses and comparison of progress, and obtain similar title.
In one example, if to the entitled headline of error correction " Deng discusses because working days problem exits XX program ".From mark The similar headline retrieved in topic search library may include " Deng Lunyin working days problem exits XX program ahead of time ", " XX program The later period of group is really severe, is specifically that Deng Lun is cheated ", " why Deng Lun exits XX program ", " in the XX program third phase, Deng Human relations band who " etc..
In one embodiment, as shown in figure 4, according to similar title, obtain each word segment second recalls word, Include:
S310: according to error correction title, alignment information is obtained.
S320: according to alignment information, second that each word segment is obtained from similar title recalls word.
In one example, alignment information may include the position letter of each word segment in the content of text to error correction title Breath and semantic information.Approximate word segment can be got from the content of text of other titles using alignment information.For example, When error correction entitled " Deng discusses because working days problem exits XX program ", similar title is that " Deng Lunyin working days problem exits ahead of time When XX program ", according to alignment information, " Deng Lun " in similar title can be formed into word with to " Deng's opinion " in error correction title Replacement pair, so that " Deng Lun " in similar title, which becomes second, recalls word.Due to not being to " Deng's opinion " in error correction title It is wrong word, therefore first recalls in word and not necessarily get " Deng Lun ", and recalls word by obtaining second and effectively make up With expand the range for recalling word, excavate and more recall word.
In one embodiment, as shown in figure 5, recalling word and second to the first of each word segment recalls word difference Carry out feature calculation, comprising:
S410: the behavioural characteristic, semantic feature, language model feature, term vector spy for recalling word to the first of each word segment Sign and attributive character carry out feature calculation;
S420: the behavioural characteristic, semantic feature, language model feature, term vector spy for recalling word to the second of each word segment Sign and attributive character carry out feature calculation.
In one example, behavioural characteristic may include previous user search and click each word segment and corresponding the One recalls word, second recalls the frequency of word.Behavioural characteristic can also include previous user by first recall word and word segment with And second recall word and probability that word segment is replaced.
In one example, semantic feature is literal relevant feature itself.Semantic feature may include word segment with Corresponding first recall word and second recall word phonetic rank editing distance feature.Semantic feature may include word segment Word and second, which is recalled, with corresponding first recalls the length difference feature of word.Semantic feature can also include word segment with it is corresponding First, which recalls word and second, recalls participle feature of word etc..
In one example, language model feature may include that the language model feature of word segment and first recall word Say the difference between the aspect of model, the language model feature of word segment and second recall the difference between the word speech aspect of model.
For example, recalling word to second carries out feature calculation: according to article content corresponding with similar title, being called together using second It returns word language model and obtains the second of each word segment word frequency information for recalling word.Word frequency information may include the words in text The number occurred in chapter.The word frequency information for recalling word based on second recalls word to second and carries out feature calculation.And to word Segment carries out feature calculation: according to the corresponding article content of word segment, obtaining each word using word segment word language model The word frequency information of language segment.Word frequency information may include the number that the words occurs in article.Word frequency based on word segment Information carries out feature calculation to word segment.Due to introducing word frequency information, so that calculated second recalls word and word The feature vector of language segment includes information more abundant.
In one example, term vector feature may include word segment and first recalls word, and word segment and second are called together Return the similar features between word.Term vector feature may include the similar features of word segment and context, first recall word with The similar features of context, second recalls the similar features of word and context.
In one example, attributive character may include word segment, first recall word, second recalls the source of word, example Such as whether coming from Baidu's entry.Attributive character may include word segment, first recall word, second recalls whether word is proprietary name Word.Attributive character may include word segment, first recall word, second recalls whether word is each synonym etc..
In one embodiment, as shown in fig. 6, according to feature calculation as a result, determining that the candidate of each word segment calls together Return word, comprising:
S510: word and second is recalled according to first and recalls the feature calculation of word as a result, using tree-model and standard sorted GBRank (Great Britain Rank) algorithm recalls word and second to first and recalls word and carry out marking sequence.
S520: according to marking ranking results, word and second is recalled from first and is recalled and is selected candidate in word and recall word.
In one embodiment, word and second is recalled to first using tree-model to recall word and carry out marking sequence, comprising:
Score is filtered out higher than preset threshold first recalls word and second and recalls word.
To score be higher than preset threshold first recall word and second recall word by score height be ranked up.
In one embodiment, the candidate based on each word segment recalls word and treats the progress error correction of error correction title, packet It includes:
The candidate corresponding to the word segment in error correction title recall word it is inconsistent in the case where, word segment is replaced Word is recalled for candidate.
In one embodiment, it is based on corpus, acquisition is recalled to first of each word segment in error correction title Word, comprising:
It treats error correction title to be segmented, obtains each word segment to error correction title.
Based on corpus and to each word segment of error correction title, obtain each word segment multiple first recall word.
It should be noted that treating the mode that error correction title is segmented can be used a variety of segmentation methods.For example, using base In the machine learning algorithm of statistics.Currently used algorithm is HMM(Hidden Markov Model, hidden Markov model), CRF(conditional random field algorithm, condition random field algorithm), SVM(Support Vector Machine, support vector machines) and deep learning scheduling algorithm.By taking CRF as an example, basic ideas are to be labeled training to Chinese character, Not only allow for the frequency of word appearance, it is also contemplated that context has preferable learning ability.
Obtain first recall word mode can be obtained by phrase substitution table, can also by phonetic editing distance come It is recalled.Wherein, the first be by excavate parallel corpora in alignment segment, and according to the alignment segment excavated come Candidate is carried out to recall.It is for second to carry out phonetic notation to segment to recall that sound is close or the candidate of unisonance later, such as in previous example The phonetic notation of " working days " is " dangqi ", it is possible to recall " current ", " having swung " etc..
In one example, as shown in fig. 7, specific error correction procedure is as follows when needing to carry out error correction to headline:
A large amount of news media's title (i.e. " text " in Fig. 7) is collected, the part text data according to conventional error correction process In method carry out global statistics and language model calculating (i.e. in Fig. 7 " text " right side arrow), which is complete The feature of office's statistics, is consistent with original error correction process.The feature of global statistics recalls word including first.
Search library is established to the news media's title being collected into, for supporting retrieval.
To being retrieved in search library to error correction title for input, obtain with to similar in error correction title it is a large amount of it is similar newly Hear title.It (excavates parallel corpora supplement by being calculated the related local knowledge of acquisition and recalls candidate, count local word frequency With the features such as language model), form accurate local knowledge.Accurate local knowledge recalls word including second.
Error correction is waited in conjunction with the global statistics feature of conventional error correction process and based on the accurate local knowledge that retrieval generates Choosing carries out error correction sequence, so that generating final candidate recalls word.
The similar title that various embodiments of the present invention are obtained from default title search library obtains second and recalls word, can effectively more Benefit recalls the word problem that may be present for recalling deficiency by first that corpus obtains.Word and second is recalled by first to recall Word may be implemented comprehensively to treat the progress error correction of error correction title from different dimensions.
Mode based on retrieval and context memory (context memory), can be obtained by retrieving and to error correction Title data similar in title excavates corresponding candidate, solves that may be present recall not for generating dynamic alignment corpus The problem of foot.Mode based on retrieval and context memory, it is only necessary to which the title data of article is simply counted and built Lithol draws, and does not need to safeguard huge knowledge base, workload decline.The accurate part to error correction title is obtained based on retrieval to know Know, does not need to treat the operation (principal component analysis and morphological analysis etc.) that error correction title carries out complexity, it is only necessary to entire title It is retrieved, effect significantly improves.
Fig. 8 shows the structural block diagram of header error correction device according to an embodiment of the present invention.As shown in fig. 7, the header error correction Device includes:
First obtains module 10, for being based on corpus, obtains and recalls word to first of each word segment in error correction title.
Second obtains module 20, for retrieval request being sent to title search library, to examine from title according to error correction title The similar title to error correction title is obtained in Suo Ku, title search library carries out data update when receiving retrieval request.
Third obtains module 30, for according to similar title, obtain each word segment second to recall word.
Computing module 40 is recalled word and is carried out based on feature respectively by recalling word and second to the first of each word segment It calculates.
Determining module 50, for according to feature calculation as a result, determining that the candidate of each word segment recalls word.
Correction module 60 recalls word for the candidate based on each word segment and treats the progress error correction of error correction title.
In one embodiment, as shown in figure 9, header error correction device further include:
Module 70 is constructed, for being based on inverted index mode, according to existing article resource, title search library is constructed, has article Including title and word content.
4th obtains module 80, for obtaining existing article resource when title search library receives retrieval request.
Update module 90, for receiving existing article resource when retrieval request, more new title according to title search library Search library.
In one embodiment, as shown in Figure 10, the second acquisition module 20 includes:
Similar title acquisition submodule 21, for being obtained from title search library according to the content of text for including to error correction title With at least one similar title similar to error correction title.
In one embodiment, as shown in figure 11, third acquisition module 30 includes:
Alignment information acquisition submodule 31, for obtaining alignment information according to error correction title.
Second recalls word acquisition submodule 32, for obtaining each word segment from similar title according to alignment information Second recall word.
In one embodiment, as shown in figure 12, computing module 40 includes:
First computational submodule 41, for recalling the behavioural characteristic, semantic feature, language mould of word to the first of each word segment Type feature, term vector feature and attributive character carry out feature calculation.
Second computational submodule 42, for recalling the behavioural characteristic, semantic feature, language of word to the second of each word segment Say that the aspect of model, term vector feature and attributive character carry out feature calculation.
In one embodiment, as shown in figure 13, determining module 50 includes:
Sorting sub-module 51 recalls the feature calculation of word as a result, using tree-model and mark for recalling word and second according to first Quasi- sequence GBRank algorithm recalls word and second to first and recalls word and carry out marking sequence.
Submodule 52 is selected, for word and second being recalled from first and recalling and select time in word according to marking ranking results Word is recalled in choosing.
In one embodiment, as shown in figure 14, correction module 60 includes:
Error correction submodule 61, for recalling the inconsistent situation of word in the candidate corresponding to the word segment in error correction title Under, word segment is replaced with into candidate and recalls word.
The function of each module in each device of the embodiment of the present invention may refer to the corresponding description in the above method, herein not It repeats again.
Figure 15 shows the structural block diagram of header error correction terminal according to an embodiment of the present invention.As shown in figure 15, the terminal packet Include: memory 910 and processor 920 are stored with the computer program that can be run on processor 920 in memory 910.It is described Processor 920 realizes the header error correction method in above-described embodiment when executing the computer program.The memory 910 and place The quantity for managing device 920 can be one or more.
The terminal further include:
Communication interface 930 carries out data header error correction transmission for being communicated with external device.
Memory 910 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non- Volatile memory), a for example, at least magnetic disk storage.
If memory 910, processor 920 and the independent realization of communication interface 930, memory 910,920 and of processor Communication interface 930 can be connected with each other by bus and complete mutual communication.The bus can be Industry Standard Architecture Structure (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral Component Interconnect) bus or extended industry-standard architecture (EISA, Extended Industry Standard Architecture) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For Convenient for indicating, only indicated with a thick line in Figure 15, it is not intended that an only bus or a type of bus.
Optionally, in specific implementation, if memory 910, processor 920 and communication interface 930 are integrated in one piece of core On piece, then memory 910, processor 920 and communication interface 930 can complete mutual communication by internal interface.
The embodiment of the invention provides a kind of computer readable storage mediums, are stored with computer program, the program quilt Processor realizes any the method in above-described embodiment when executing.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise Clear specific restriction.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass Defeated program is for instruction execution system, device or equipment or the use device in conjunction with these instruction execution systems, device or equipment. The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electrical connection of one or more wirings Portion's (electronic device), portable computer diskette box (magnetic device), random-access memory (ram), read-only memory (ROM) can It wipes editable read-only memory (EPROM or flash memory), fiber device and portable read-only memory (CDROM). In addition, computer-readable medium can even is that the paper that can print described program on it or other suitable media, because can For example by carrying out optical scanner to paper or other media, then to be edited, be interpreted or when necessary with other suitable methods It is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer In readable storage medium storing program for executing.The storage medium can be read-only memory, disk or CD etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in its various change or replacement, These should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the guarantor of the claim It protects subject to range.

Claims (16)

1. a kind of header error correction method characterized by comprising
Based on corpus, obtains and recall word to first of each word segment in error correction title;
According to described to error correction title, retrieval request is sent to title search library, described in obtaining from the title search library To the similar title of error correction title, the title search library carries out data update when receiving the retrieval request;
According to the similar title, obtain each word segment second recalls word;
Word and second is recalled to the first of each word segment to recall word and carry out feature calculation respectively;
According to feature calculation as a result, determining that the candidate of each word segment recalls word;
Candidate based on each word segment recalls word and carries out error correction to error correction title to described.
2. the method according to claim 1, wherein constructing the title search library and including:
The title search library is constructed, the existing article includes title according to existing article resource based on inverted index mode And word content;
Obtain the existing article resource when title search library receives the retrieval request;
The existing article resource when retrieval request is received according to the title search library, updates the title search library.
3. the method according to claim 1, wherein being obtained from the title search library described to error correction title Similar title, comprising:
According to the content of text for including to error correction title, obtain with described from the title search library to error correction title phase At least one approximate similar title.
4. the method according to claim 1, wherein obtaining each word segment according to the similar title Second recalls word, comprising:
According to described to error correction title, alignment information is obtained;
According to the alignment information, second that each word segment is obtained from the similar title recalls word.
5. the method according to claim 1, wherein recalling word and second to the first of each word segment It recalls word and carries out feature calculation respectively, comprising:
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the first of each word segment And attributive character carries out feature calculation;
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the second of each word segment And attributive character carries out feature calculation.
6. according to the method described in claim 5, it is characterized in that, according to feature calculation as a result, determining each word piece The candidate of section recalls word, comprising:
Word and described second, which is recalled, according to described first recalls the feature calculation of word as a result, using tree-model and standard sorted GBRank algorithm recalls word and second to first and recalls word and carry out marking sequence;
According to marking ranking results, word and second is recalled from described first and is recalled and is selected candidate in word and recall word.
7. according to the method described in claim 6, it is characterized in that, the candidate based on each word segment recalls word to institute It states and carries out error correction to error correction title, comprising:
The candidate corresponding to the word segment in error correction title recall word it is inconsistent in the case where, will be described Word segment replaces with the candidate and recalls word.
8. a kind of header error correction device characterized by comprising
First obtains module, for being based on corpus, obtains and recalls word to first of each word segment in error correction title;
Second obtains module, for, to error correction title, sending retrieval request to title search library according to described, with from the title The similar title to error correction title, title search library progress when receiving the retrieval request are obtained in search library Data update;
Third obtains module, for according to the similar title, obtain each word segment second to recall word;
Computing module recalls word and carries out feature calculation respectively for recalling word and second to the first of each word segment;
Determining module, for according to feature calculation as a result, determining that the candidate of each word segment recalls word;
Correction module recalls word for the candidate based on each word segment and carries out error correction to error correction title to described.
9. device according to claim 8, which is characterized in that further include:
Construct module, for being based on inverted index mode, according to existing article resource, construct the title search library, it is described Having article includes title and word content;
4th obtains module, for obtaining the existing article resource when title search library receives the retrieval request;
Update module updates institute for receiving the existing article resource when retrieval request according to the title search library State title search library.
10. device according to claim 8, which is characterized in that described second, which obtains module, includes:
Similar title acquisition submodule, the content of text for including to error correction title according to, from the title search library It is middle to obtain and described at least one similar title similar to error correction title.
11. device according to claim 8, which is characterized in that the third obtains module and includes:
Alignment information acquisition submodule, for, to error correction title, obtaining alignment information according to described;
Second recalls word acquisition submodule, for obtaining each word from the similar title according to the alignment information The second of language segment recalls word.
12. device according to claim 8, which is characterized in that the computing module includes:
First computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the first of each word segment The aspect of model, term vector feature and attributive character carry out feature calculation;
Second computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the second of each word segment The aspect of model, term vector feature and attributive character carry out feature calculation.
13. device according to claim 12, which is characterized in that the determining module includes:
Sorting sub-module recalls the feature calculation of word as a result, using tree mould for recalling word and described second according to described first Type and standard sorted GBRank algorithm recall word and second to first and recall word and carry out marking sequence;
Submodule is selected, for word and second being recalled from described first and recalling and select candidate in word according to marking ranking results Recall word.
14. device according to claim 13, which is characterized in that the correction module includes:
Error correction submodule, it is inconsistent for recalling word in the candidate corresponding to the word segment in error correction title In the case where, the word segment is replaced with into the candidate and recalls word.
15. a kind of header error correction terminal characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors It realizes according to claim 1 to any one of 7 the methods.
16. a kind of computer readable storage medium, is stored with computer program, which is characterized in that the program is held by processor It realizes when row according to claim 1 to any one of 7 the methods.
CN201910617118.1A 2019-07-10 2019-07-10 Header error correction method and apparatus Active CN110134970B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910617118.1A CN110134970B (en) 2019-07-10 2019-07-10 Header error correction method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910617118.1A CN110134970B (en) 2019-07-10 2019-07-10 Header error correction method and apparatus

Publications (2)

Publication Number Publication Date
CN110134970A true CN110134970A (en) 2019-08-16
CN110134970B CN110134970B (en) 2019-10-22

Family

ID=67566876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910617118.1A Active CN110134970B (en) 2019-07-10 2019-07-10 Header error correction method and apparatus

Country Status (1)

Country Link
CN (1) CN110134970B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160013A (en) * 2019-12-30 2020-05-15 北京百度网讯科技有限公司 Text error correction method and device
CN111931495A (en) * 2020-07-13 2020-11-13 上海德拓信息技术股份有限公司 Corpus fast matching method and error correction method based on dichotomy and editing distance
CN112000767A (en) * 2020-07-31 2020-11-27 深思考人工智能科技(上海)有限公司 Text-based information extraction method and electronic equipment
CN112416929A (en) * 2020-11-17 2021-02-26 四川长虹电器股份有限公司 Retrieval library management and data retrieval method based on mysql and java

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049498A1 (en) * 2002-07-03 2004-03-11 Dehlinger Peter J. Text-classification code, system and method
CN101650742A (en) * 2009-08-27 2010-02-17 中兴通讯股份有限公司 System and method for prompting search condition during English search
CN104462085A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Method and device for correcting search keywords
CN106503033A (en) * 2016-09-14 2017-03-15 国网山东省电力公司青岛供电公司 A kind of single address search method of power distribution network work and device
CN109543022A (en) * 2018-12-17 2019-03-29 北京百度网讯科技有限公司 Text error correction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040049498A1 (en) * 2002-07-03 2004-03-11 Dehlinger Peter J. Text-classification code, system and method
CN101650742A (en) * 2009-08-27 2010-02-17 中兴通讯股份有限公司 System and method for prompting search condition during English search
CN104462085A (en) * 2013-09-12 2015-03-25 腾讯科技(深圳)有限公司 Method and device for correcting search keywords
CN106503033A (en) * 2016-09-14 2017-03-15 国网山东省电力公司青岛供电公司 A kind of single address search method of power distribution network work and device
CN109543022A (en) * 2018-12-17 2019-03-29 北京百度网讯科技有限公司 Text error correction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEIXIN_34199405: "《百度中文纠错技术》", 《HTTPS://BLOG.CSDN.NET/WEIXIN_34199405/ARTICLE/DETAILS/89952654》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160013A (en) * 2019-12-30 2020-05-15 北京百度网讯科技有限公司 Text error correction method and device
CN111160013B (en) * 2019-12-30 2023-11-24 北京百度网讯科技有限公司 Text error correction method and device
CN111931495A (en) * 2020-07-13 2020-11-13 上海德拓信息技术股份有限公司 Corpus fast matching method and error correction method based on dichotomy and editing distance
CN112000767A (en) * 2020-07-31 2020-11-27 深思考人工智能科技(上海)有限公司 Text-based information extraction method and electronic equipment
CN112416929A (en) * 2020-11-17 2021-02-26 四川长虹电器股份有限公司 Retrieval library management and data retrieval method based on mysql and java

Also Published As

Publication number Publication date
CN110134970B (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110134970B (en) Header error correction method and apparatus
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
US10102191B2 (en) Propagation of changes in master content to variant content
US8078645B2 (en) Operations on multi-level nested data structure
US8332434B2 (en) Method and system for finding appropriate semantic web ontology terms from words
KR101339103B1 (en) Document classifying system and method using semantic feature
CN106940726B (en) Creative automatic generation method and terminal based on knowledge network
US8078638B2 (en) Operations of multi-level nested data structure
CN107704512B (en) Financial product recommendation method based on social data, electronic device and medium
US20130110839A1 (en) Constructing an analysis of a document
CN106844341B (en) Artificial intelligence-based news abstract extraction method and device
US20100094835A1 (en) Automatic query concepts identification and drifting for web search
US11977589B2 (en) Information search method, device, apparatus and computer-readable medium
US20110295857A1 (en) System and method for aligning and indexing multilingual documents
EP3203383A1 (en) Text generation system
CN108304375A (en) A kind of information identifying method and its equipment, storage medium, terminal
US8983965B2 (en) Document rating calculation system, document rating calculation method and program
US20150006528A1 (en) Hierarchical data structure of documents
CN106663117A (en) Constructing a graph that facilitates provision of exploratory suggestions
CN109726289A (en) Event detecting method and device
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
US10678820B2 (en) System and method for computerized semantic indexing and searching
CN110674365A (en) Searching method, device, equipment and storage medium
CN111753534B (en) Identifying sequence titles in a document
CN114141384A (en) Method, apparatus and medium for retrieving medical data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant