CN110134970A - Header error correction method and apparatus - Google Patents
Header error correction method and apparatus Download PDFInfo
- Publication number
- CN110134970A CN110134970A CN201910617118.1A CN201910617118A CN110134970A CN 110134970 A CN110134970 A CN 110134970A CN 201910617118 A CN201910617118 A CN 201910617118A CN 110134970 A CN110134970 A CN 110134970A
- Authority
- CN
- China
- Prior art keywords
- word
- title
- error correction
- recalls
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The embodiment of the present invention proposes that a kind of header error correction method and apparatus, method include: to obtain to recall word to first of each word segment in error correction title based on corpus;Retrieval request is sent to title search library according to error correction title, the similar title to error correction title is obtained from title search library, title search library carries out data update when receiving retrieval request;Obtain each word segment according to similar title second recalls word;Word and second, which is recalled, to the first of each word segment recalls word progress feature calculation;Determine that the candidate of each word segment recalls word;Candidate based on each word segment recalls word and treats the progress error correction of error correction title.The similar title that the embodiment of the present invention is obtained from default title search library obtains second and recalls word, can effectively make up and recall the word problem that may be present for recalling deficiency by the first of corpus acquisition.Word and second is recalled by first to recall word and may be implemented comprehensively to treat error correction title from different dimensions and carry out error correction.
Description
Technical field
The present invention relates to text recognition technique field more particularly to a kind of header error correction method and apparatus.
Background technique
Current text error correction solution can be good at solving the wrong word error correction of universal class, but for comprising knowing
The text Error Correcting Problem for knowing category information is not well solved.For example, when content of text includes star's name, and name
In some word when being wrongly write, if the wrong word error correction based on universal class may not be able to find that star's name is write
It is wrong.The mistake of name writes the scope that might not belong to wrong word.So as to cause text content, there is no by complete error correction success.
It follows that existing error correction scheme can not the very good solution above problem.
Summary of the invention
The embodiment of the present invention provides a kind of header error correction method and apparatus, to solve one or more skills in the prior art
Art problem.
In a first aspect, the embodiment of the invention provides a kind of header error correction methods, comprising:
Based on corpus, obtains and recall word to first of each word segment in error correction title;
According to described to error correction title, retrieval request is sent to title search library, described in obtaining from the title search library
To the similar title of error correction title, the title search library carries out data update when receiving the retrieval request;
According to the similar title, obtain each word segment second recalls word;
Word and second is recalled to the first of each word segment to recall word and carry out feature calculation respectively;
According to feature calculation as a result, determining that the candidate of each word segment recalls word;
Candidate based on each word segment recalls word and carries out error correction to error correction title to described.
In one embodiment, constructing the title search library includes:
The title search library is constructed, the existing article includes title according to existing article resource based on inverted index mode
And word content;
Obtain the existing article resource when title search library receives the retrieval request;
The existing article resource when retrieval request is received according to the title search library, updates the title search library.
In one embodiment, the similar title to error correction title is obtained from the title search library, comprising:
According to the content of text for including to error correction title, obtain with described from the title search library to error correction title phase
At least one approximate similar title.
In one embodiment, according to the similar title, obtain each word segment second recalls word, comprising:
According to described to error correction title, alignment information is obtained;
According to the alignment information, second that each word segment is obtained from the similar title recalls word.
In one embodiment, word and second is recalled to the first of each word segment to recall word and carry out spy respectively
Sign calculates, comprising:
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the first of each word segment
And attributive character carries out feature calculation;
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the second of each word segment
And attributive character carries out feature calculation.
In one embodiment, it according to feature calculation as a result, determine that the candidate of each word segment recalls word, wraps
It includes:
Word and described second, which is recalled, according to described first recalls the feature calculation of word as a result, using tree-model and standard sorted
GBRank algorithm recalls word and second to first and recalls word and carry out marking sequence;
According to marking ranking results, word and second is recalled from described first and is recalled and is selected candidate in word and recall word.
In one embodiment, the candidate based on each word segment recalls word and carries out to described to error correction title
Error correction, comprising:
The candidate corresponding to the word segment in error correction title recall word it is inconsistent in the case where, will be described
Word segment replaces with the candidate and recalls word.
Second aspect, the embodiment of the invention provides a kind of header error correction devices, comprising:
First obtains module, for being based on corpus, obtains and recalls word to first of each word segment in error correction title;
Second obtains module, for, to error correction title, sending retrieval request to title search library according to described, with from the title
The similar title to error correction title, title search library progress when receiving the retrieval request are obtained in search library
Data update;
Third obtains module, for according to the similar title, obtain each word segment second to recall word;
Computing module recalls word and carries out feature calculation respectively for recalling word and second to the first of each word segment;
Determining module, for according to feature calculation as a result, determining that the candidate of each word segment recalls word;
Correction module recalls word for the candidate based on each word segment and carries out error correction to error correction title to described.
In one embodiment, further includes:
Construct module, for being based on inverted index mode, according to existing article resource, construct the title search library, it is described
Having article includes title and word content;
4th obtains module, for obtaining the existing article resource when title search library receives the retrieval request;
Update module updates institute for receiving the existing article resource when retrieval request according to the title search library
State title search library.
In one embodiment, the second acquisition module includes:
Similar title acquisition submodule, the content of text for including to error correction title according to, from the title search library
It is middle to obtain and described at least one similar title similar to error correction title.
In one embodiment, the third acquisition module includes:
Alignment information acquisition submodule, for, to error correction title, obtaining alignment information according to described;
Second recalls word acquisition submodule, for obtaining each word from the similar title according to the alignment information
The second of language segment recalls word.
In one embodiment, the computing module includes:
First computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the first of each word segment
The aspect of model, term vector feature and attributive character carry out feature calculation;
Second computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the second of each word segment
The aspect of model, term vector feature and attributive character carry out feature calculation.
In one embodiment, the determining module includes:
Sorting sub-module recalls the feature calculation of word as a result, using tree mould for recalling word and described second according to described first
Type and standard sorted GBRank algorithm recall word and second to first and recall word and carry out marking sequence;
Submodule is selected, for word and second being recalled from described first and recalling and select candidate in word according to marking ranking results
Recall word.
In one embodiment, the correction module includes:
Error correction submodule, it is inconsistent for recalling word in the candidate corresponding to the word segment in error correction title
In the case where, the word segment is replaced with into the candidate and recalls word.
The third aspect, the embodiment of the invention provides a kind of header error correction terminal, the function of the header error correction terminal can
Corresponding software realization can also be executed by hardware by hardware realization.The hardware or software include one or more
Module corresponding with above-mentioned function.
It is described to deposit including processor and memory in the structure of the header error correction terminal in a possible design
Reservoir is used to store the program for supporting the header error correction terminal to execute above-mentioned header error correction method, the processor is configured to
For executing the program stored in the memory.The header error correction terminal can also include communication interface, be used for and other
Equipment or communication.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage mediums, for storing header error correction end
End computer software instructions used comprising for executing program involved in above-mentioned header error correction method.
A technical solution in above-mentioned technical proposal have the following advantages that or the utility model has the advantages that the embodiment of the present invention from default
The similar title that title search library obtains obtains second and recalls word, can effectively make up and recall word by the first of corpus acquisition
It is that may be present to recall insufficient problem.By first recall word and second recall word may be implemented it is comprehensively right from different dimensions
Error correction is carried out to error correction title.
Above-mentioned general introduction is merely to illustrate that the purpose of book, it is not intended to be limited in any way.Except foregoing description
Schematical aspect, except embodiment and feature, by reference to attached drawing and the following detailed description, the present invention is further
Aspect, embodiment and feature, which will be, to be readily apparent that.
Detailed description of the invention
In the accompanying drawings, unless specified otherwise herein, otherwise indicate the same or similar through the identical appended drawing reference of multiple attached drawings
Component or element.What these attached drawings were not necessarily to scale.It should be understood that these attached drawings depict only according to the present invention
Disclosed some embodiments, and should not serve to limit the scope of the present invention.
Fig. 1 shows the flow chart of header error correction method according to an embodiment of the present invention.
Fig. 2 shows the flow charts of the building title search library of header error correction method according to an embodiment of the present invention.
Fig. 3 shows the flow chart of header error correction method according to another embodiment of the present invention.
Fig. 4 shows the specific flow chart of the step S300 of header error correction method according to an embodiment of the present invention.
Fig. 5 shows the specific flow chart of the step S400 of header error correction method according to an embodiment of the present invention.
Fig. 6 shows the specific flow chart of the step S500 of header error correction method according to an embodiment of the present invention.
Fig. 7 shows the flow chart of application example according to an embodiment of the present invention.
Fig. 8 shows the structural block diagram of header error correction device according to an embodiment of the present invention.
Fig. 9 shows the structural block diagram of header error correction device according to another embodiment of the present invention.
Figure 10 shows the structural block diagram of the second acquisition module of header error correction device according to an embodiment of the present invention.
The third that Figure 11 shows header error correction device according to an embodiment of the present invention obtains the structural block diagram of module.
Figure 12 shows the structural block diagram of the computing module of header error correction device according to an embodiment of the present invention.
Figure 13 shows the structural block diagram of the determining module of header error correction device according to an embodiment of the present invention.
Figure 14 shows the structural block diagram of the correction module of header error correction device according to an embodiment of the present invention.
Figure 15 shows the structural schematic diagram of header error correction terminal according to an embodiment of the present invention.
Specific embodiment
Hereinafter, certain exemplary embodiments are simply just described.As one skilled in the art will recognize that
Like that, without departing from the spirit or scope of the present invention, described embodiment can be modified by various different modes.
Therefore, attached drawing and description are considered essentially illustrative rather than restrictive.
Fig. 1 shows the flow chart of header error correction method according to an embodiment of the present invention.As shown in Figure 1, the header error correction side
Method includes:
S100: being based on corpus, obtains and recalls word to first of each word segment in error correction title.Corpus can use
Any corpus applied in existing text error correcting technique.The step can be realized according to existing error correction process to error correction title
Global statistics and whole error correction.Based on corpus, the first of each word segment got, which recalls word, to be one or more
It is a.Each word segment is not quite similar as the first candidate quantity for recalling word.For example, some word segments have multiple first
Word is recalled as candidate, and some word segments may only include one first and recall word as candidate.It should be noted that the
One to recall in word may include word segment itself.
In one example, it can be obtained according to each words context logic relationship in error correction title based on corpus
The first of each word segment is taken to recall word.Can also according to word segment each in error correction title whether be wrong word logic
Relationship, obtain each word segment based on corpus first recall word.
It should be noted that the corpus referred in the present embodiment is segmented into multiple types.Such as: (1) heterogeneous language
Expect library: without specific Corpus Selection Rule, collecting extensively and store various corpus as former state.(2) it the corpus of homogeneity: only collects
The corpus of same class content.(3) corpus of system: corpus is collected according to predetermined principle and ratio, there is corpus
Balance and systematicness can represent the linguistic fact in a certain range.(4) it dedicated corpus: only collects and is used for a certain spy
Determine the corpus of purposes.In addition to this, according to the languages of corpus, corpus is also segmented into the corpus of single language, bilingual corpus
Library and multi-lingual corpus.According to the acquisition units of corpus, corpus can be divided into the corpus of the corpus of a language piece, sentence again
Library, phrase corpus.Bilingual and multi-lingual corpus is further divided into parallel (alignment) corpus according to the organizational form of corpus
With compare corpus.
S200: according to error correction title, retrieval request is sent to title search library, to obtain from title search library wait entangle
The similar title of wrong title.Title search library carries out data update when receiving retrieval request.Since title search library is connecing
It receives when the retrieval request of error correction title, will do it data update, therefore ensure that the data resource in title search library begins
It is eventually latest data resource, so as to more accurately find the similar title to error correction title from title search library.Phase
Like title to can be in mark of many aspects such as sentence structure, content of text and semanteme as similar evaluation to error correction title
It is quasi-.
In one example, the title in current existing any field can be collected in title search library.Such as news
Report title, article title, book title etc..Article content corresponding with the title stored in title search library can be together
It is stored in title search library, can also be obtained from cloud.
In another example, judge whether the title to store in error correction title and title search library is similar, Ke Yitong
It crosses vector space model and carries out similarity calculation.Substantially process are as follows: count basic language unit (word, phrase, the phrase of title
Deng), and assign certain weight.Each title forms a vector to indicate the title, and then the similitude between title passes through
The distance of vector indicates.
S300: according to similar title, obtain each word segment second recalls word.The of each word segment obtained
Two recall word can for one or more.The quantity that each word segment corresponding second recalls word can be different.For example, have
Word segment has multiple second to recall word as candidate, and some word segments may only include one second and recall word as time
Choosing.Second to recall in word may include word segment itself, also may include the first of the word segment recalling word.Due to second
Recalling word is obtained based on similar title, and the acquisition logic and dimension for recalling word with first are different, therefore second recalls
Word can be used for abundant and supplement first and recall word.So as to excavate more parallel corporas of each word segment.
S400: word and second is recalled to the first of each word segment and recalls word and carries out feature calculation respectively.First recalls
Word and the second specific features for recalling in word for carrying out feature calculation can be selected according to demand.Specific features can be managed
Xie Weiyu first recalls word and second and recalls the associated feature of word.
In one example, it first recalls word and second and recalls word and carry out the specific features phase that is utilized when feature calculation
Together.
S500: according to feature calculation as a result, determining that the candidate of each word segment recalls word.Candidate recalls word and calls together from first
It returns word or second recalls in word and be determined.When word segment only correspond to one first recall word and second recall word when, then from
It recalls for two and selects one in word as candidate and recall word.It is recalled when word segment corresponds to multiple first and recalls word and multiple second
When word, then optimal one is selected in word as candidate recall word from respectively recalling.
S600: the candidate based on each word segment recalls word and treats the progress error correction of error correction title.Pass through each word piece
The corresponding candidate of section, which recalls word, can determine to which whether the word segment in error correction title needs to be replaced, and need to replace
The words changed into.For example, identical if with corresponding candidate recalling word to word segment in error correction title, then it is assumed that the word piece
Section is correct.If in error correction title some word segment and corresponding candidate to recall word not identical, then it is assumed that the word piece
Corresponding candidate is recalled word and is substituted into in error correction title, to complete error correction by core dumped.
In one embodiment, as shown in Fig. 2, header error correction method further includes constructing the process of title search library, packet
It includes:
S700: being based on inverted index mode, according to existing article resource, constructs title search library, existing article include title and
Word content.Existing article may include any headed article of tool such as news report, periodical, paper, model.Existing article
It can be obtained and be updated by various ways such as server, cloud, database, big datas.
In one example, it is that the title is corresponding complete that existing article, which includes the word content in title and word content,
Word content is also possible to embody the word content index information of complete word content.For example, title search library can be only root
Index is simply counted and is established according to the title data of article.And article content corresponding with title can be used as in text
The mode for holding index information carries out quick obtaining from cloud by establishing association, to reduce the difficulty of building title search library
Degree, and reduce the maintenance cost of title search library.
In one example, title search library can be based on inverted index and global search technology, according to existing article
Title and article content are constructed.Although being stored with article content data, maintenance cost and retrieval in title search library
Workload is still far below existing huge knowledge base.
Inverted index is also commonly known as reverse indexing, merging archives or reversed archives, is a kind of indexing means, is used to deposit
Store up the mapping of storage location of some word in a document or one group of document under full-text search.It is file retrieval system
Most common data structure in system.It can include the lists of documents of this word according to word quick obtaining by inverted index.
Inverted index is mainly made of two parts: " word lexicon " and " inverted file ".
In one example, title search library can be according to the text got within the scope of certain time before building search library
Zhang Ziyuan is constructed.
S800: existing article resource when title search library receives retrieval request is obtained.Due to receiving retrieval request
Time is the time for lagging behind building search library, therefore can be by the time between building search library and reception retrieval request
The new existing article resource that may be generated in section.Also, when title search library receives new retrieval request again, new
There is also the regular hour is poor between retrieval request and a upper retrieval request, thus may also can be generated in the time difference it is new
There is article resource.And update existing article resource be possible to can exist with to the associated similar title of error correction title.
S900: existing article resource when retrieval request, more new title search library are received according to title search library.Pass through
The new existing article resource generated in period after the last update of distance can be added to by more new title search library
In title search library.So that the resource in title search library be able to maintain always it is newest.Improve title search library when
Effect property, and can more accurately be got from title search library according to retrieval request associated similar to error correction title
Title.
In one embodiment, it as shown in figure 3, according to error correction title, obtains from title search library to error correction mark
The similar title of topic, comprising:
S210: it according to the content of text for including to error correction title, is obtained from title search library similar with to error correction title
At least one similar title.Due to being stored with headed content of text in title search library, by error correction title
Content of text can quickly find corresponding similar title.Without being subjected to the sub- constituent of distich, morphology and part of speech
Deng the complicated analyses and comparison of progress, and obtain similar title.
In one example, if to the entitled headline of error correction " Deng discusses because working days problem exits XX program ".From mark
The similar headline retrieved in topic search library may include " Deng Lunyin working days problem exits XX program ahead of time ", " XX program
The later period of group is really severe, is specifically that Deng Lun is cheated ", " why Deng Lun exits XX program ", " in the XX program third phase, Deng
Human relations band who " etc..
In one embodiment, as shown in figure 4, according to similar title, obtain each word segment second recalls word,
Include:
S310: according to error correction title, alignment information is obtained.
S320: according to alignment information, second that each word segment is obtained from similar title recalls word.
In one example, alignment information may include the position letter of each word segment in the content of text to error correction title
Breath and semantic information.Approximate word segment can be got from the content of text of other titles using alignment information.For example,
When error correction entitled " Deng discusses because working days problem exits XX program ", similar title is that " Deng Lunyin working days problem exits ahead of time
When XX program ", according to alignment information, " Deng Lun " in similar title can be formed into word with to " Deng's opinion " in error correction title
Replacement pair, so that " Deng Lun " in similar title, which becomes second, recalls word.Due to not being to " Deng's opinion " in error correction title
It is wrong word, therefore first recalls in word and not necessarily get " Deng Lun ", and recalls word by obtaining second and effectively make up
With expand the range for recalling word, excavate and more recall word.
In one embodiment, as shown in figure 5, recalling word and second to the first of each word segment recalls word difference
Carry out feature calculation, comprising:
S410: the behavioural characteristic, semantic feature, language model feature, term vector spy for recalling word to the first of each word segment
Sign and attributive character carry out feature calculation;
S420: the behavioural characteristic, semantic feature, language model feature, term vector spy for recalling word to the second of each word segment
Sign and attributive character carry out feature calculation.
In one example, behavioural characteristic may include previous user search and click each word segment and corresponding the
One recalls word, second recalls the frequency of word.Behavioural characteristic can also include previous user by first recall word and word segment with
And second recall word and probability that word segment is replaced.
In one example, semantic feature is literal relevant feature itself.Semantic feature may include word segment with
Corresponding first recall word and second recall word phonetic rank editing distance feature.Semantic feature may include word segment
Word and second, which is recalled, with corresponding first recalls the length difference feature of word.Semantic feature can also include word segment with it is corresponding
First, which recalls word and second, recalls participle feature of word etc..
In one example, language model feature may include that the language model feature of word segment and first recall word
Say the difference between the aspect of model, the language model feature of word segment and second recall the difference between the word speech aspect of model.
For example, recalling word to second carries out feature calculation: according to article content corresponding with similar title, being called together using second
It returns word language model and obtains the second of each word segment word frequency information for recalling word.Word frequency information may include the words in text
The number occurred in chapter.The word frequency information for recalling word based on second recalls word to second and carries out feature calculation.And to word
Segment carries out feature calculation: according to the corresponding article content of word segment, obtaining each word using word segment word language model
The word frequency information of language segment.Word frequency information may include the number that the words occurs in article.Word frequency based on word segment
Information carries out feature calculation to word segment.Due to introducing word frequency information, so that calculated second recalls word and word
The feature vector of language segment includes information more abundant.
In one example, term vector feature may include word segment and first recalls word, and word segment and second are called together
Return the similar features between word.Term vector feature may include the similar features of word segment and context, first recall word with
The similar features of context, second recalls the similar features of word and context.
In one example, attributive character may include word segment, first recall word, second recalls the source of word, example
Such as whether coming from Baidu's entry.Attributive character may include word segment, first recall word, second recalls whether word is proprietary name
Word.Attributive character may include word segment, first recall word, second recalls whether word is each synonym etc..
In one embodiment, as shown in fig. 6, according to feature calculation as a result, determining that the candidate of each word segment calls together
Return word, comprising:
S510: word and second is recalled according to first and recalls the feature calculation of word as a result, using tree-model and standard sorted GBRank
(Great Britain Rank) algorithm recalls word and second to first and recalls word and carry out marking sequence.
S520: according to marking ranking results, word and second is recalled from first and is recalled and is selected candidate in word and recall word.
In one embodiment, word and second is recalled to first using tree-model to recall word and carry out marking sequence, comprising:
Score is filtered out higher than preset threshold first recalls word and second and recalls word.
To score be higher than preset threshold first recall word and second recall word by score height be ranked up.
In one embodiment, the candidate based on each word segment recalls word and treats the progress error correction of error correction title, packet
It includes:
The candidate corresponding to the word segment in error correction title recall word it is inconsistent in the case where, word segment is replaced
Word is recalled for candidate.
In one embodiment, it is based on corpus, acquisition is recalled to first of each word segment in error correction title
Word, comprising:
It treats error correction title to be segmented, obtains each word segment to error correction title.
Based on corpus and to each word segment of error correction title, obtain each word segment multiple first recall word.
It should be noted that treating the mode that error correction title is segmented can be used a variety of segmentation methods.For example, using base
In the machine learning algorithm of statistics.Currently used algorithm is HMM(Hidden Markov Model, hidden Markov model),
CRF(conditional random field algorithm, condition random field algorithm), SVM(Support Vector
Machine, support vector machines) and deep learning scheduling algorithm.By taking CRF as an example, basic ideas are to be labeled training to Chinese character,
Not only allow for the frequency of word appearance, it is also contemplated that context has preferable learning ability.
Obtain first recall word mode can be obtained by phrase substitution table, can also by phonetic editing distance come
It is recalled.Wherein, the first be by excavate parallel corpora in alignment segment, and according to the alignment segment excavated come
Candidate is carried out to recall.It is for second to carry out phonetic notation to segment to recall that sound is close or the candidate of unisonance later, such as in previous example
The phonetic notation of " working days " is " dangqi ", it is possible to recall " current ", " having swung " etc..
In one example, as shown in fig. 7, specific error correction procedure is as follows when needing to carry out error correction to headline:
A large amount of news media's title (i.e. " text " in Fig. 7) is collected, the part text data according to conventional error correction process
In method carry out global statistics and language model calculating (i.e. in Fig. 7 " text " right side arrow), which is complete
The feature of office's statistics, is consistent with original error correction process.The feature of global statistics recalls word including first.
Search library is established to the news media's title being collected into, for supporting retrieval.
To being retrieved in search library to error correction title for input, obtain with to similar in error correction title it is a large amount of it is similar newly
Hear title.It (excavates parallel corpora supplement by being calculated the related local knowledge of acquisition and recalls candidate, count local word frequency
With the features such as language model), form accurate local knowledge.Accurate local knowledge recalls word including second.
Error correction is waited in conjunction with the global statistics feature of conventional error correction process and based on the accurate local knowledge that retrieval generates
Choosing carries out error correction sequence, so that generating final candidate recalls word.
The similar title that various embodiments of the present invention are obtained from default title search library obtains second and recalls word, can effectively more
Benefit recalls the word problem that may be present for recalling deficiency by first that corpus obtains.Word and second is recalled by first to recall
Word may be implemented comprehensively to treat the progress error correction of error correction title from different dimensions.
Mode based on retrieval and context memory (context memory), can be obtained by retrieving and to error correction
Title data similar in title excavates corresponding candidate, solves that may be present recall not for generating dynamic alignment corpus
The problem of foot.Mode based on retrieval and context memory, it is only necessary to which the title data of article is simply counted and built
Lithol draws, and does not need to safeguard huge knowledge base, workload decline.The accurate part to error correction title is obtained based on retrieval to know
Know, does not need to treat the operation (principal component analysis and morphological analysis etc.) that error correction title carries out complexity, it is only necessary to entire title
It is retrieved, effect significantly improves.
Fig. 8 shows the structural block diagram of header error correction device according to an embodiment of the present invention.As shown in fig. 7, the header error correction
Device includes:
First obtains module 10, for being based on corpus, obtains and recalls word to first of each word segment in error correction title.
Second obtains module 20, for retrieval request being sent to title search library, to examine from title according to error correction title
The similar title to error correction title is obtained in Suo Ku, title search library carries out data update when receiving retrieval request.
Third obtains module 30, for according to similar title, obtain each word segment second to recall word.
Computing module 40 is recalled word and is carried out based on feature respectively by recalling word and second to the first of each word segment
It calculates.
Determining module 50, for according to feature calculation as a result, determining that the candidate of each word segment recalls word.
Correction module 60 recalls word for the candidate based on each word segment and treats the progress error correction of error correction title.
In one embodiment, as shown in figure 9, header error correction device further include:
Module 70 is constructed, for being based on inverted index mode, according to existing article resource, title search library is constructed, has article
Including title and word content.
4th obtains module 80, for obtaining existing article resource when title search library receives retrieval request.
Update module 90, for receiving existing article resource when retrieval request, more new title according to title search library
Search library.
In one embodiment, as shown in Figure 10, the second acquisition module 20 includes:
Similar title acquisition submodule 21, for being obtained from title search library according to the content of text for including to error correction title
With at least one similar title similar to error correction title.
In one embodiment, as shown in figure 11, third acquisition module 30 includes:
Alignment information acquisition submodule 31, for obtaining alignment information according to error correction title.
Second recalls word acquisition submodule 32, for obtaining each word segment from similar title according to alignment information
Second recall word.
In one embodiment, as shown in figure 12, computing module 40 includes:
First computational submodule 41, for recalling the behavioural characteristic, semantic feature, language mould of word to the first of each word segment
Type feature, term vector feature and attributive character carry out feature calculation.
Second computational submodule 42, for recalling the behavioural characteristic, semantic feature, language of word to the second of each word segment
Say that the aspect of model, term vector feature and attributive character carry out feature calculation.
In one embodiment, as shown in figure 13, determining module 50 includes:
Sorting sub-module 51 recalls the feature calculation of word as a result, using tree-model and mark for recalling word and second according to first
Quasi- sequence GBRank algorithm recalls word and second to first and recalls word and carry out marking sequence.
Submodule 52 is selected, for word and second being recalled from first and recalling and select time in word according to marking ranking results
Word is recalled in choosing.
In one embodiment, as shown in figure 14, correction module 60 includes:
Error correction submodule 61, for recalling the inconsistent situation of word in the candidate corresponding to the word segment in error correction title
Under, word segment is replaced with into candidate and recalls word.
The function of each module in each device of the embodiment of the present invention may refer to the corresponding description in the above method, herein not
It repeats again.
Figure 15 shows the structural block diagram of header error correction terminal according to an embodiment of the present invention.As shown in figure 15, the terminal packet
Include: memory 910 and processor 920 are stored with the computer program that can be run on processor 920 in memory 910.It is described
Processor 920 realizes the header error correction method in above-described embodiment when executing the computer program.The memory 910 and place
The quantity for managing device 920 can be one or more.
The terminal further include:
Communication interface 930 carries out data header error correction transmission for being communicated with external device.
Memory 910 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non-
Volatile memory), a for example, at least magnetic disk storage.
If memory 910, processor 920 and the independent realization of communication interface 930, memory 910,920 and of processor
Communication interface 930 can be connected with each other by bus and complete mutual communication.The bus can be Industry Standard Architecture
Structure (ISA, Industry Standard Architecture) bus, external equipment interconnection (PCI, Peripheral
Component Interconnect) bus or extended industry-standard architecture (EISA, Extended Industry
Standard Architecture) bus etc..The bus can be divided into address bus, data/address bus, control bus etc..For
Convenient for indicating, only indicated with a thick line in Figure 15, it is not intended that an only bus or a type of bus.
Optionally, in specific implementation, if memory 910, processor 920 and communication interface 930 are integrated in one piece of core
On piece, then memory 910, processor 920 and communication interface 930 can complete mutual communication by internal interface.
The embodiment of the invention provides a kind of computer readable storage mediums, are stored with computer program, the program quilt
Processor realizes any the method in above-described embodiment when executing.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show
The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example
Point is included at least one embodiment or example of the invention.Moreover, particular features, structures, materials, or characteristics described
It may be combined in any suitable manner in any one or more of the embodiments or examples.In addition, without conflicting with each other, this
The technical staff in field can be by the spy of different embodiments or examples described in this specification and different embodiments or examples
Sign is combined.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance
Or implicitly indicate the quantity of indicated technical characteristic." first " is defined as a result, the feature of " second " can be expressed or hidden
It include at least one this feature containing ground.In the description of the present invention, the meaning of " plurality " is two or more, unless otherwise
Clear specific restriction.
Any process described otherwise above or method description are construed as in flow chart or herein, and expression includes
It is one or more for realizing specific logical function or process the step of executable instruction code module, segment or portion
Point, and the range of the preferred embodiment of the present invention includes other realization, wherein can not press shown or discussed suitable
Sequence, including according to related function by it is basic simultaneously in the way of or in the opposite order, Lai Zhihang function, this should be of the invention
Embodiment person of ordinary skill in the field understood.
Expression or logic and/or step described otherwise above herein in flow charts, for example, being considered use
In the order list for the executable instruction for realizing logic function, may be embodied in any computer-readable medium, for
Instruction execution system, device or equipment (such as computer based system, including the system of processor or other can be held from instruction
The instruction fetch of row system, device or equipment and the system executed instruction) it uses, or combine these instruction execution systems, device or set
It is standby and use.For the purpose of this specification, " computer-readable medium ", which can be, any may include, stores, communicates, propagates or pass
Defeated program is for instruction execution system, device or equipment or the use device in conjunction with these instruction execution systems, device or equipment.
The more specific example (non-exhaustive list) of computer-readable medium include the following: there is the electrical connection of one or more wirings
Portion's (electronic device), portable computer diskette box (magnetic device), random-access memory (ram), read-only memory (ROM) can
It wipes editable read-only memory (EPROM or flash memory), fiber device and portable read-only memory (CDROM).
In addition, computer-readable medium can even is that the paper that can print described program on it or other suitable media, because can
For example by carrying out optical scanner to paper or other media, then to be edited, be interpreted or when necessary with other suitable methods
It is handled electronically to obtain described program, is then stored in computer storage.
It should be appreciated that each section of the invention can be realized with hardware, software, firmware or their combination.Above-mentioned
In embodiment, software that multiple steps or method can be executed in memory and by suitable instruction execution system with storage
Or firmware is realized.It, and in another embodiment, can be under well known in the art for example, if realized with hardware
Any one of column technology or their combination are realized: having a logic gates for realizing logic function to data-signal
Discrete logic, with suitable combinational logic gate circuit specific integrated circuit, programmable gate array (PGA), scene
Programmable gate array (FPGA) etc..
Those skilled in the art are understood that realize all or part of step that above-described embodiment method carries
It suddenly is that relevant hardware can be instructed to complete by program, the program can store in a kind of computer-readable storage medium
In matter, which when being executed, includes the steps that one or a combination set of embodiment of the method.
It, can also be in addition, each functional unit in each embodiment of the present invention can integrate in a processing module
It is that each unit physically exists alone, can also be integrated in two or more units in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also be realized in the form of software function module.The integrated module is such as
Fruit is realized and when sold or used as an independent product in the form of software function module, also can store in a computer
In readable storage medium storing program for executing.The storage medium can be read-only memory, disk or CD etc..
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any
Those familiar with the art in the technical scope disclosed by the present invention, can readily occur in its various change or replacement,
These should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the guarantor of the claim
It protects subject to range.
Claims (16)
1. a kind of header error correction method characterized by comprising
Based on corpus, obtains and recall word to first of each word segment in error correction title;
According to described to error correction title, retrieval request is sent to title search library, described in obtaining from the title search library
To the similar title of error correction title, the title search library carries out data update when receiving the retrieval request;
According to the similar title, obtain each word segment second recalls word;
Word and second is recalled to the first of each word segment to recall word and carry out feature calculation respectively;
According to feature calculation as a result, determining that the candidate of each word segment recalls word;
Candidate based on each word segment recalls word and carries out error correction to error correction title to described.
2. the method according to claim 1, wherein constructing the title search library and including:
The title search library is constructed, the existing article includes title according to existing article resource based on inverted index mode
And word content;
Obtain the existing article resource when title search library receives the retrieval request;
The existing article resource when retrieval request is received according to the title search library, updates the title search library.
3. the method according to claim 1, wherein being obtained from the title search library described to error correction title
Similar title, comprising:
According to the content of text for including to error correction title, obtain with described from the title search library to error correction title phase
At least one approximate similar title.
4. the method according to claim 1, wherein obtaining each word segment according to the similar title
Second recalls word, comprising:
According to described to error correction title, alignment information is obtained;
According to the alignment information, second that each word segment is obtained from the similar title recalls word.
5. the method according to claim 1, wherein recalling word and second to the first of each word segment
It recalls word and carries out feature calculation respectively, comprising:
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the first of each word segment
And attributive character carries out feature calculation;
Behavioural characteristic, semantic feature, language model feature, the term vector feature for recalling word to the second of each word segment
And attributive character carries out feature calculation.
6. according to the method described in claim 5, it is characterized in that, according to feature calculation as a result, determining each word piece
The candidate of section recalls word, comprising:
Word and described second, which is recalled, according to described first recalls the feature calculation of word as a result, using tree-model and standard sorted
GBRank algorithm recalls word and second to first and recalls word and carry out marking sequence;
According to marking ranking results, word and second is recalled from described first and is recalled and is selected candidate in word and recall word.
7. according to the method described in claim 6, it is characterized in that, the candidate based on each word segment recalls word to institute
It states and carries out error correction to error correction title, comprising:
The candidate corresponding to the word segment in error correction title recall word it is inconsistent in the case where, will be described
Word segment replaces with the candidate and recalls word.
8. a kind of header error correction device characterized by comprising
First obtains module, for being based on corpus, obtains and recalls word to first of each word segment in error correction title;
Second obtains module, for, to error correction title, sending retrieval request to title search library according to described, with from the title
The similar title to error correction title, title search library progress when receiving the retrieval request are obtained in search library
Data update;
Third obtains module, for according to the similar title, obtain each word segment second to recall word;
Computing module recalls word and carries out feature calculation respectively for recalling word and second to the first of each word segment;
Determining module, for according to feature calculation as a result, determining that the candidate of each word segment recalls word;
Correction module recalls word for the candidate based on each word segment and carries out error correction to error correction title to described.
9. device according to claim 8, which is characterized in that further include:
Construct module, for being based on inverted index mode, according to existing article resource, construct the title search library, it is described
Having article includes title and word content;
4th obtains module, for obtaining the existing article resource when title search library receives the retrieval request;
Update module updates institute for receiving the existing article resource when retrieval request according to the title search library
State title search library.
10. device according to claim 8, which is characterized in that described second, which obtains module, includes:
Similar title acquisition submodule, the content of text for including to error correction title according to, from the title search library
It is middle to obtain and described at least one similar title similar to error correction title.
11. device according to claim 8, which is characterized in that the third obtains module and includes:
Alignment information acquisition submodule, for, to error correction title, obtaining alignment information according to described;
Second recalls word acquisition submodule, for obtaining each word from the similar title according to the alignment information
The second of language segment recalls word.
12. device according to claim 8, which is characterized in that the computing module includes:
First computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the first of each word segment
The aspect of model, term vector feature and attributive character carry out feature calculation;
Second computational submodule, for recalling the behavioural characteristic, semantic feature, language of word to the second of each word segment
The aspect of model, term vector feature and attributive character carry out feature calculation.
13. device according to claim 12, which is characterized in that the determining module includes:
Sorting sub-module recalls the feature calculation of word as a result, using tree mould for recalling word and described second according to described first
Type and standard sorted GBRank algorithm recall word and second to first and recall word and carry out marking sequence;
Submodule is selected, for word and second being recalled from described first and recalling and select candidate in word according to marking ranking results
Recall word.
14. device according to claim 13, which is characterized in that the correction module includes:
Error correction submodule, it is inconsistent for recalling word in the candidate corresponding to the word segment in error correction title
In the case where, the word segment is replaced with into the candidate and recalls word.
15. a kind of header error correction terminal characterized by comprising
One or more processors;
Storage device, for storing one or more programs;
When one or more of programs are executed by one or more of processors, so that one or more of processors
It realizes according to claim 1 to any one of 7 the methods.
16. a kind of computer readable storage medium, is stored with computer program, which is characterized in that the program is held by processor
It realizes when row according to claim 1 to any one of 7 the methods.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910617118.1A CN110134970B (en) | 2019-07-10 | 2019-07-10 | Header error correction method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910617118.1A CN110134970B (en) | 2019-07-10 | 2019-07-10 | Header error correction method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110134970A true CN110134970A (en) | 2019-08-16 |
CN110134970B CN110134970B (en) | 2019-10-22 |
Family
ID=67566876
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910617118.1A Active CN110134970B (en) | 2019-07-10 | 2019-07-10 | Header error correction method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110134970B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160013A (en) * | 2019-12-30 | 2020-05-15 | 北京百度网讯科技有限公司 | Text error correction method and device |
CN111931495A (en) * | 2020-07-13 | 2020-11-13 | 上海德拓信息技术股份有限公司 | Corpus fast matching method and error correction method based on dichotomy and editing distance |
CN112000767A (en) * | 2020-07-31 | 2020-11-27 | 深思考人工智能科技(上海)有限公司 | Text-based information extraction method and electronic equipment |
CN112416929A (en) * | 2020-11-17 | 2021-02-26 | 四川长虹电器股份有限公司 | Retrieval library management and data retrieval method based on mysql and java |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040049498A1 (en) * | 2002-07-03 | 2004-03-11 | Dehlinger Peter J. | Text-classification code, system and method |
CN101650742A (en) * | 2009-08-27 | 2010-02-17 | 中兴通讯股份有限公司 | System and method for prompting search condition during English search |
CN104462085A (en) * | 2013-09-12 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Method and device for correcting search keywords |
CN106503033A (en) * | 2016-09-14 | 2017-03-15 | 国网山东省电力公司青岛供电公司 | A kind of single address search method of power distribution network work and device |
CN109543022A (en) * | 2018-12-17 | 2019-03-29 | 北京百度网讯科技有限公司 | Text error correction method and device |
-
2019
- 2019-07-10 CN CN201910617118.1A patent/CN110134970B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040049498A1 (en) * | 2002-07-03 | 2004-03-11 | Dehlinger Peter J. | Text-classification code, system and method |
CN101650742A (en) * | 2009-08-27 | 2010-02-17 | 中兴通讯股份有限公司 | System and method for prompting search condition during English search |
CN104462085A (en) * | 2013-09-12 | 2015-03-25 | 腾讯科技(深圳)有限公司 | Method and device for correcting search keywords |
CN106503033A (en) * | 2016-09-14 | 2017-03-15 | 国网山东省电力公司青岛供电公司 | A kind of single address search method of power distribution network work and device |
CN109543022A (en) * | 2018-12-17 | 2019-03-29 | 北京百度网讯科技有限公司 | Text error correction method and device |
Non-Patent Citations (1)
Title |
---|
WEIXIN_34199405: "《百度中文纠错技术》", 《HTTPS://BLOG.CSDN.NET/WEIXIN_34199405/ARTICLE/DETAILS/89952654》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111160013A (en) * | 2019-12-30 | 2020-05-15 | 北京百度网讯科技有限公司 | Text error correction method and device |
CN111160013B (en) * | 2019-12-30 | 2023-11-24 | 北京百度网讯科技有限公司 | Text error correction method and device |
CN111931495A (en) * | 2020-07-13 | 2020-11-13 | 上海德拓信息技术股份有限公司 | Corpus fast matching method and error correction method based on dichotomy and editing distance |
CN112000767A (en) * | 2020-07-31 | 2020-11-27 | 深思考人工智能科技(上海)有限公司 | Text-based information extraction method and electronic equipment |
CN112416929A (en) * | 2020-11-17 | 2021-02-26 | 四川长虹电器股份有限公司 | Retrieval library management and data retrieval method based on mysql and java |
Also Published As
Publication number | Publication date |
---|---|
CN110134970B (en) | 2019-10-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110134970B (en) | Header error correction method and apparatus | |
CN107180045B (en) | Method for extracting geographic entity relation contained in internet text | |
US10102191B2 (en) | Propagation of changes in master content to variant content | |
US8078645B2 (en) | Operations on multi-level nested data structure | |
US8332434B2 (en) | Method and system for finding appropriate semantic web ontology terms from words | |
KR101339103B1 (en) | Document classifying system and method using semantic feature | |
CN106940726B (en) | Creative automatic generation method and terminal based on knowledge network | |
US8078638B2 (en) | Operations of multi-level nested data structure | |
CN107704512B (en) | Financial product recommendation method based on social data, electronic device and medium | |
US20130110839A1 (en) | Constructing an analysis of a document | |
CN106844341B (en) | Artificial intelligence-based news abstract extraction method and device | |
US20100094835A1 (en) | Automatic query concepts identification and drifting for web search | |
US11977589B2 (en) | Information search method, device, apparatus and computer-readable medium | |
US20110295857A1 (en) | System and method for aligning and indexing multilingual documents | |
EP3203383A1 (en) | Text generation system | |
CN108304375A (en) | A kind of information identifying method and its equipment, storage medium, terminal | |
US8983965B2 (en) | Document rating calculation system, document rating calculation method and program | |
US20150006528A1 (en) | Hierarchical data structure of documents | |
CN106663117A (en) | Constructing a graph that facilitates provision of exploratory suggestions | |
CN109726289A (en) | Event detecting method and device | |
CN110162768B (en) | Method and device for acquiring entity relationship, computer readable medium and electronic equipment | |
US10678820B2 (en) | System and method for computerized semantic indexing and searching | |
CN110674365A (en) | Searching method, device, equipment and storage medium | |
CN111753534B (en) | Identifying sequence titles in a document | |
CN114141384A (en) | Method, apparatus and medium for retrieving medical data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |