CN103116752A - Picture auditing method and system - Google Patents

Picture auditing method and system Download PDF

Info

Publication number
CN103116752A
CN103116752A CN2013100587586A CN201310058758A CN103116752A CN 103116752 A CN103116752 A CN 103116752A CN 2013100587586 A CN2013100587586 A CN 2013100587586A CN 201310058758 A CN201310058758 A CN 201310058758A CN 103116752 A CN103116752 A CN 103116752A
Authority
CN
China
Prior art keywords
candidate word
character
probability
text
picture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013100587586A
Other languages
Chinese (zh)
Inventor
郝双
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sina Technology China Co Ltd
Original Assignee
Sina Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sina Technology China Co Ltd filed Critical Sina Technology China Co Ltd
Priority to CN2013100587586A priority Critical patent/CN103116752A/en
Publication of CN103116752A publication Critical patent/CN103116752A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a picture auditing method and system. The method comprises the steps of performing optical character recognition (OCR) processing on text pictures, and extracting text information in the text pictures; performing keyword matching on the extracted text information, and judging whether the text information contains keywords needing filtering; and if the text information contains the keywords needing filtering, performing filtering processing on the text pictures. Due to the fact that the text information in the text pictures is extracted and auditing of the keywords needing filtering is performed on the text pictures according to the extracted text information, the goal of auditing the text pictures can be achieved.

Description

The picture examination method and system
Technical field
The present invention relates to image processing techniques, relate in particular to the picture examination method and system.
Background technology
Along with the rise of internet, provide more abundant information to people, it has greatly merged global information, expanded the approach of people's obtaining informations, has increased the scope that people search for information.At present, the content of the picture on domestic internet is generally adopted manual examination and verification, but the manual examination and verification workload is large, efficient is low, cost is high, its accuracy can be subject to light, the impact of the uncertain factors such as auditor's degree of fatigue.
Also have in addition some special picture examination systems, it mainly utilizes image matching technology that pending picture and the picture of examining in the storehouse are carried out similarity relatively, selects the higher picture of similarity to reject filtration.Its main method flow process comprises the steps: as shown in Figure 1
S101: pending picture is carried out feature extraction.
S102: feature in the feature extracted and audit picture feature storehouse is compared.
S103: the comparative result similarity is rejected filtration higher than the pending picture of threshold value.
The picture examination method of prior art, it examines emphasis, often concentrate on color lump, lines, on the graph image that the key elements such as shape consist of, for different pictures, its these elements that comprise have larger difference, therefore can utilize these image-elements to calculate corresponding characteristics of image different images is distinguished.
But for the text picture, the characteristics of arranging of its image-element are, and are with the full line stripe-arrangement, local with the close-packed lattice difference on each pixel overall situation.For different pictures, visually there is no notable difference outside it, can't be distinguished different pictures by the computed image feature.
And, for the audit of text picture, be mainly the signal auditing entrained to word content, carry out the method that similarity contrasts and filter if still adopt to set up the audit picture library, the audit storehouse picture number of required foundation is huge, hardly can be exhaustive.So this has determined that also the text picture is not suitable for examining filtration with the technology of Image Feature Matching.
Therefore, the picture examination method of prior art is inapplicable examines the text picture.
Summary of the invention
Embodiments of the invention provide a kind of picture examination method and system, in order to the text picture is examined.
According to an aspect of the present invention, provide a kind of picture examination method, having comprised:
The text picture is carried out OCR process, extract the text message in text picture;
The text message that extracts is carried out the coupling of crucial character/word, whether judgement wherein comprises the crucial character/word that will filter; If text picture is carried out filtration treatment.
Further, described, the text picture is carried out the OCR processing, before extracting the text message in text picture, also comprises:
Described text picture is carried out binary conversion treatment.
Further, described described text picture is carried out binary conversion treatment before, also comprise:
Described text picture is carried out gray processing to be processed.
Further, described, the text picture is carried out the OCR processing, before extracting the text message in text picture, also comprise: described text picture is removed noise processed.
Wherein, describedly the text picture is carried out OCR process, the text message that extracts in text picture specifically comprises:
Image to described text picture carries out Character segmentation;
To the character that cuts out in described word picture, divide according to setting unit; And each character of setting in unit is identified:
After each character in described setting unit is carried out feature extraction and characteristic matching, determine the candidate word of each character;
For each character in described setting unit, determine the transition probability between the candidate word of the character that each candidate word of the similarity of each candidate word of this character and this character is adjacent with this character;
According to the similarity of determining and transition probability, determine the recognition result of the character in described setting unit;
According to each recognition result of setting the character in unit, determine the text message in text picture.
Wherein, the similarity that described basis is determined and transition probability, the recognition result of determining the character in described setting unit specifically comprises:
The Viterbi probability of determining the candidate word of the 1st character in this setting unit is the similarity of this candidate word;
From the 2nd character in this setting unit, each candidate word for current character, according to the transition probability between the candidate word of front character in the similarity of this candidate word and this candidate word character adjacent with this current character, determine each candidate word and the described Viterbi probability between each candidate word of front character of described current character;
Determining current candidate word and each after the Viterbi probability between front candidate word, relatively each Viterbi probability, therefrom select maximum Viterbi probability as the Viterbi probability of current candidate word; Wherein, one of candidate word that described current candidate word is described current character, described is one of described candidate word at front character at front candidate word;
With described current candidate word as present node, select with this current candidate word between the Viterbi maximum probability front candidate word conduct adjacent with this current candidate word at front nodal point;
Determine path candidate; Wherein, the candidate word of each node in described path candidate for selecting for each character in described setting unit respectively, the adjacent node in same path candidate are determining at front nodal point according to each candidate word;
The Viterbi probability of the final node of each path candidate relatively, with the path candidate of the Viterbi maximum probability of final node as described recognition result.
Wherein, described according to this candidate word similarity and the transition probability between the candidate word of front character in this candidate word character adjacent with this current character, determine each candidate word and the described Viterbi probability between each candidate word of front character of described current character, specifically according to following formula 5 or 4:
P v=P 2* R * P v' (formula 5)
LogP v=b * logP 2+ c * logR+d * logP v' (formula 4)
Wherein, P vBe described current candidate word and described Viterbi probability between front candidate word; P 1Be the probability of occurrence of described current candidate word, P 2Be described transition probability between front candidate word and described current candidate word; R is the similarity of described current candidate word; P v' be described Viterbi probability at front candidate word; LogP v, logP 1, logP 2LogR, logP v' be respectively P v, P 1, P 2, R, P v' the value that obtains after taking the logarithm; B, c, d are respectively the weighted value of setting.
Wherein, described recognition result is also determined according to the probability of occurrence of each candidate word of each character in described setting unit; And
According to the similarity of determining and transition probability, and according to the probability of occurrence of each candidate word of each character in described setting unit, the recognition result of determining the character in described setting unit specifically comprises:
Set the Viterbi probability of the candidate word of the 1st character in unit for this, determine according to the similarity of this candidate word and/or the probability of occurrence of this candidate word;
From the 2nd character in this setting unit, each candidate word for current character, the transition probability between the candidate word of front character in the character adjacent with this current character according to similarity, probability of occurrence and this candidate word of this candidate word is determined each candidate word and the described Viterbi probability between each candidate word of front character of described current character;
Determining current candidate word and each after the Viterbi probability between front candidate word, relatively each Viterbi probability, therefrom select maximum Viterbi probability as the Viterbi probability of current candidate word; Wherein, one of candidate word that described current candidate word is described current character, described is one of described candidate word at front character at front candidate word;
With described current candidate word as present node, select with this current candidate word between the Viterbi maximum probability front candidate word conduct adjacent with this current candidate word at front nodal point;
Determine path candidate; Wherein, the candidate word of each node in described path candidate for selecting for each character in described setting unit respectively, the adjacent node in same path candidate are determining at front nodal point according to each candidate word;
The Viterbi probability of the final node of each path candidate relatively, with the path candidate of the Viterbi maximum probability of final node as described recognition result.
According to another aspect of the present invention, also provide a kind of picture examination system, having comprised:
The text message extraction module is used for that the text picture is carried out OCR and processes, and extracts the text message in text picture;
Filtering module is used for the text message that described text message extraction module extracts is carried out the coupling of crucial character/word, and whether judgement wherein comprises the crucial character/word that will filter; If text picture is carried out filtration treatment.
Further, described system also comprises: pretreatment module;
Described pretreatment module is used for described text picture is carried out pre-service, and to the pretreated text picture of described text message extraction module output; Wherein,
Described pretreatment module specifically comprises: the binarization unit that is used for described text picture is carried out binary conversion treatment; Perhaps,
Described pretreatment module specifically comprises: be used for described text picture is carried out the gray processing unit that gray processing is processed and exported, be used for the text picture of described gray processing unit output is carried out the binarization unit of binary conversion treatment; Perhaps,
Described pretreatment module specifically comprises: be used for described text picture is carried out the gray processing unit that gray processing is processed and exported; And be used for text picture to the unit output of described gray processing and carry out the binarization unit exported after binary conversion treatment; And the noise removing unit that is used for the text picture of described binarization unit output is removed noise processed.
The embodiment of the present invention is owing to extracting the text message in the text picture, according to the text message that extracts, the text picture carried out the audit of the crucial character/word that will filter, thereby can realize purpose that the text picture is examined.
In addition, the embodiment of the present invention is due in the process of carrying out character recognition, a plurality of candidate words for character, except the similarity (being font information) of foundation candidate word, also according to the transition probability between the neighboring candidate word (being semantic information), select one as the recognition result of this character from a plurality of candidate words; Thereby both outside the similarity with reference to candidate word and character, also considered the factor of this candidate word and the degree of association between rear character, and considered these factors and can greatly improve the accuracy rate of character recognition.
Further, also can decide recognition result with reference to the probability of occurrence of candidate word, further guarantee the accuracy rate of character recognition.
Further, determining many path candidates in the mode of calculating the Viterbi probability in the present invention, is a kind of preferably with the method for intercharacter incidence relation as the reference that determines recognition result, further guarantees the accuracy rate of character recognition.
Description of drawings
Fig. 1 is the picture examination method flow diagram of prior art;
Fig. 2 a is the picture examination method flow diagram of the embodiment of the present invention;
Fig. 2 b is the method flow diagram that carries out character recognition of the embodiment of the present invention;
Fig. 3 is the method flow diagram of the recognition result of the interior character of definite setting unit of the embodiment of the present invention;
Fig. 4 be the embodiment of the present invention set the schematic diagram of the candidate word of character that unit is got and each character with the word behavior;
Fig. 5 is similarity, the transition probability that the basis of the embodiment of the present invention is determined, definite method flow diagram of setting the recognition result of the character in unit;
Fig. 6 is the inner structure block diagram of the picture examination system of the embodiment of the present invention.
Embodiment
For making purpose of the present invention, technical scheme and advantage clearer, referring to accompanying drawing and enumerate preferred embodiment, the present invention is described in more detail.Yet, need to prove, many details of listing in instructions are only in order to make the reader to one or more aspects of the present invention, a thorough understanding be arranged, even if do not have these specific details also can realize these aspects of the present invention.
The terms such as " module " used in this application, " system " are intended to comprise the entity relevant to computing machine, such as but not limited to hardware, firmware, combination thereof, software or executory software.For example, module can be, but be not limited in: the thread of the process of moving on processor, processor, object, executable program, execution, program and/or computing machine.For instance, the application program of moving on computing equipment and this computing equipment can be modules.One or more modules can be positioned at an executory process and/or thread, and module also can be on a computing machine and/or be distributed between two or more computing machines.
The present inventor considers, for the text picture, its audit focuses on the word content information in the text picture, therefore the text message in picture can be extracted and differentiate, and realizes the audit to the text picture.Thus, the invention provides a kind of method and system of the picture examination that extracts based on text message, the text picture is examined.
Describe the technical scheme of the embodiment of the present invention in detail below in conjunction with accompanying drawing.In the technical scheme of the embodiment of the present invention, for the text picture, carry out the method flow of picture examination, as shown in Fig. 2 a, comprise the steps:
S211: the text picture is carried out OCR process, extract the text message in text picture.
Preferably, the text picture is being carried out OCR(Optical Character Recognition, optical character identification) process, before extracting the text message in text picture, also can carry out some pre-service to the text picture, comprise: the text picture is carried out gray processing process, the text picture is carried out binary conversion treatment, also can remove noise processed to described text picture.So that extract text message better from the text picture.Removing noise processed can be specifically before the text picture being carried out the gray processing processing, can be also before or after the text picture is carried out binary conversion treatment; The number of times of removing noise processed can be once, can be repeatedly also, can be according to picture quality, decide as the case may be.
S212: the text message that extracts is carried out the coupling of crucial character/word, whether judgement wherein comprises the crucial character/word that will filter; If execution in step S213 carries out filtration treatment to the text picture; Otherwise execution in step S214 keeps the text picture.
Particularly, the text message that extracts is carried out the coupling of crucial character/word: have the crucial character/word (sensitive word of being called is also arranged) that will filter in filtering key word/dictionary, with the text message that extracts with filter that have in key word/dictionary, crucial character/word that will filter and mate, whether include the crucial character/word that will filter in the text message that judgement is extracted; If include the crucial character/word that will filter, execution in step S213, carry out filtration treatment to the text picture; Otherwise execution in step S214 keeps the text picture.
S213: the text picture is carried out filtration treatment.
Particularly, include the crucial character/word that will filter in the text message of extraction, show that audit do not pass through, the text picture is carried out filtration treatment.
S214: the text picture is kept.
Particularly, do not comprise the crucial character/word that will filter in the text message of extraction, show to examine and pass through, the text picture is kept.
In above-mentioned steps S211, the text picture is carried out OCR process, extract the method for the text message in text picture, those skilled in the art also can adopt several different methods to carry out the extraction of the text message in the text picture; A kind of concrete grammar that provides in the embodiment of the present invention, flow process comprises the steps: as shown in Fig. 2 b
S200: the image in the text picture is carried out Character segmentation.
S201: to the character that cuts out in the text picture, divide according to setting unit.
The text picture of input may be the text picture that comprises a plurality of paragraphs, a plurality of literal lines; In the present invention, be that the character in the text picture is divided according to setting unit, process in batches; That is to say, the each processing is to identify for the character in same setting unit.
Those skilled in the art can arrange setting unit according to actual conditions, and for example, arranging and setting unit is literal line, namely in the text picture with the character of delegation as the character in same setting unit;
Perhaps, arranging and setting unit is paragraph, namely in the text picture character in same paragraph as the character in same setting unit;
Perhaps, arranging and setting unit is the fixed character number, as, arranging and setting unit is 10 number of characters, namely in the text picture, every 10 characters are divided into the interior character of same setting unit.
S202: identify for each character of setting in unit.
Successively each being set unit in order processes: each character in this setting unit is identified.Fig. 3 shows and sets unit for one, determines the method flow of the recognition result of the character that this unit of settinging is interior, specifically comprises the steps:
S301: after each character in this setting unit is carried out feature extraction and characteristic matching, determine the candidate word of each character.
Character is carried out feature extraction and characteristic matching, determine that the method for several candidate words of this character can adopt the common method that adopts in prior art, the technology that is well known to those skilled in the art repeats no more herein.
S302: set each character in unit for this, determine the similarity of each candidate word of this character, and the transition probability between the candidate word of each candidate word of this character character adjacent with this character.
After determining each candidate word of character, also can determine the similarity of each candidate word, i.e. the similarity degree of each candidate word and this character;
After determining each candidate word of character, also can be for each candidate word of this character, determine respectively the transition probability between the candidate word of this candidate word character adjacent with this character; For ease of describing, the candidate word with adjacent character is called the neighboring candidate word herein, and the transition probability between the candidate word of the character that above-mentioned candidate word is adjacent with this character is the transition probability between the neighboring candidate word; Transition probability between the neighboring candidate word refers to, the probability that the neighboring candidate word occurs together.
For example, as shown in Figure 4, set unit with the word behavior and got 9 characters, sequence number is respectively 1-9; The candidate word of 1-9 character, and the similarity of each candidate word following (similarity is the numerical value in bracket):
The candidate word of the 1st character comprises: in (0.9);
The candidate word of the 2nd character comprises: state (0.8), group (0.6);
The candidate word of the 3rd character comprises: fortune (0.9);
The candidate word of the 4th character comprises: moving (0.8), strength (0.8);
The candidate word of the 5th character comprises: member (0.8);
The candidate word of the 6th character comprises: become (0.8);
The candidate word of the 7th character comprises: achievement (0.9);
The candidate word of the 8th character comprises: happiness (0.9);
The candidate word of the 9th character comprises: people (0.9), enter (0.9).
Each candidate word and the adjacent transition probability between the candidate word of front character, i.e. transition probability between the neighboring candidate word, after taking the logarithm, as follows:
China :-0.5644877; Middle group :-5.6734289; State's fortune :-2.864447; Group's fortune :-3.303452; Motion :-0.7526801; Fortune strength :-3.527933; Mobilize :-1.370795; Strength unit :-2.221847; The member becomes :-2.667307; Achievement :-1.386276; Achievement happiness :-2.938662; Gratifying :-1.630958; Happiness enters :-3.583296.
Can find out, candidate word " state ", the candidate word at front character that is adjacent " in " between transition probability take the logarithm after for-0.5644877; Candidate word " group ", the candidate word at front character that is adjacent " in " between transition probability take the logarithm after for-5.6734289; " in " and " state " between transition probability be greater than " in " and " group " between transition probability, this means that the probability that " China " occurs together is greater than " middle group ".
S303: according to the similarity of determining and transition probability, determine the recognition result of the character in described setting unit.
In this step, more preferably, also can determine according to the probability of occurrence of each candidate word of each character in the described setting unit that determines the recognition result of the character in described setting unit; The probability of occurrence of candidate word refers to the probability that this candidate word of counting is used.
After determining each recognition result of setting the character in unit, determine the text message of text picture according to each recognition result of setting the character in unit.
Namely according to the similarity determined, and transition probability, determine the recognition result of the character in described setting unit; The concrete grammar flow process comprises the steps: as shown in Figure 5
S501: the Viterbi probability that calculates each candidate word of each character in this setting unit;
This Viterbi probability of setting the candidate word of the 1st character in unit can following method be determined:
With the probability of occurrence of this candidate word Viterbi probability as this candidate word;
Perhaps, with the similarity of this candidate word Viterbi probability as this candidate word;
Perhaps, according to the similarity of this candidate word and the probability of occurrence Viterbi probability as this candidate word, such as, with the product of the probability of occurrence of the similarity of this candidate word and this candidate word Viterbi probability as this candidate word.
From the 2nd character in this setting unit, each candidate word for current character, according to the transition probability between the candidate word of front character in the similarity of this candidate word and this candidate word character adjacent with this current character, determine respectively each candidate word of current character, and the Viterbi probability between each candidate word of front character; The adjacent character of current character can be included in rear character and at front character, during Viterbi probability between the candidate word of the candidate word that calculates current character and adjacent character, can be to calculate the candidate word of current character and the Viterbi probability between the candidate word of front character, can be also to calculate the candidate word of current character and the Viterbi probability between the candidate word of rear character;
The embodiment of the present invention is carried out the explanation of detailed scheme as an example of the candidate word that calculates current character and the Viterbi probability between the candidate word of front character example:
From the 2nd character in this setting unit, each candidate word for current character, the transition probability between the candidate word of front character in the character adjacent with this current character according to similarity, probability of occurrence and this candidate word of this candidate word, determine respectively each candidate word of current character, and the Viterbi probability between each candidate word of front character can calculate according to following formula 1 or formula 2 or formula 3 specifically:
P v=P 1* P 2* R * P v' (formula 1)
In formula 1, P vBe current candidate word and the Viterbi probability between front candidate word, wherein, current candidate word is one of candidate word of current character, is at one of candidate word of front character at front candidate word; P 1Be the probability of occurrence of current candidate word, P 2Be the transition probability between front candidate word and current candidate word; R is the similarity of current candidate word; P v' be the Viterbi probability at front candidate word.
LogP v=logP 1+ logP 2+ logR+logP v' (formula 2)
In formula 2, logP v, logP 1, logP 2, logR, logP v' be respectively P v, P 1, P 2, R, P v' the value that obtains after taking the logarithm;
LogP v=a * logP 1+ b * logP 2+ c * logR+d * logP v' (formula 3)
In formula 3, a, b, c, d are respectively the weighted value of setting, and those skilled in the art can arrange according to actual conditions; In fact, if a=0 is set, above-mentioned formula 3 reality are as shown in Equation 4:
LogP v=b * logP 2+ c * logR+d * logP v' (formula 4)
Can find out from formula 4, can be only according to similarity and this current candidate word and the transition probability between front candidate word of current candidate word, determine current candidate word and the Viterbi probability between front candidate word, that is to say, the current candidate word that calculates according to formula 4 and the Viterbi probability between front candidate word are not considered the probability of occurrence of current candidate word.
If b=1, c=1, d=1 in formula 4 are set, formula 4 is that available formula 5 is expressed:
P v=P 2* R * P v' (formula 5)
That is to say, from the 2nd character in this setting unit, each candidate word for current character, according to the transition probability between the candidate word of front character in the similarity of this candidate word and this candidate word character adjacent with this current character, determine respectively each candidate word of current character, and the Viterbi probability between each candidate word of front character can be determined according to as above formula 4 or 5 specifically.
After the Viterbi probability between front candidate word, relatively each Viterbi probability, therefrom select maximum Viterbi probability as the Viterbi probability of current candidate word at definite current candidate word and each; And with current candidate word as present node, select with current candidate word between the Viterbi maximum probability front candidate word conduct adjacent with this current candidate word at front nodal point.
S502: according to the Viterbi probability of each candidate word that calculates, determine path candidate;
According to each candidate word of determining at front nodal point, determine some path candidates; Wherein, the candidate word of each node in path candidate for selecting for each character in described setting unit respectively, the adjacent node in same path candidate are determining at front nodal point according to each candidate word.
For example, each candidate word of each character shown in Fig. 4 can be determined two path candidates according to said method, is respectively:
Path candidate one: people-happiness-achievement-one-tenth-member-moving-fortune-state-in;
Path candidate two: enter-like-achievement-one-tenth-member-moving-fortune-state-in.
S503: select a path candidate as recognition result.
In this step, the Viterbi probability of the final node of each path candidate relatively, with the path candidate of the Viterbi maximum probability of final node as described recognition result.
For example, for above-mentioned path candidate one and path candidate two, Viterbi probability due to the final node " people " of path candidate one, the Viterbi probability that " enters " greater than the final node of path candidate two, therefore, the final decision recognition result is path candidate one, thereby obtain the recognition result that sequence number in Fig. 4 is respectively the character of 1-9 be: Chinese athlete's achievement is gratifying.
The path candidate of selecting has thus considered font information (similarity) and semantic information (transition probability), and synthesis result is maximal value, has higher accuracy rate than the prior art of only considering font information (similarity).
A kind of picture examination system that the embodiment of the present invention provides as shown in Figure 6, comprising: text message extraction module 601, filtering module 602.
Text message extraction module 601 is used for that the text picture is carried out OCR to be processed, and extracts the text message in text picture; A kind of concrete grammar that text message extraction module 601 extracts text message describes in detail in the step of above-mentioned Fig. 2 b, Fig. 3, Fig. 5, repeats no more herein; In addition, those skilled in the art also can adopt other method to carry out the extraction of the text message in the text picture.
Filtering module 602 is used for the text message that text message extraction module 601 extracts is carried out the coupling of crucial character/word, and whether judgement wherein comprises the crucial character/word that will filter; If text picture is carried out filtration treatment; Otherwise, text picture is kept.
Further, also can comprise in the picture examination system: pretreatment module 603;
Pretreatment module 603 is used for described text picture is carried out pre-service, and to the pretreated text picture of described text message extraction module output; After text message extraction module 601 receives the text picture of pretreatment module 603 outputs, the text picture that receives is carried out OCR process, extract the text message in text picture.
Wherein, pretreatment module 603 specifically comprises: the binarization unit that is used for described text picture is carried out binary conversion treatment; The text picture that binarization unit output binary conversion treatment is crossed is to text message extraction module 601.
Perhaps, pretreatment module 603 specifically comprises: be used for described text picture is carried out the gray processing unit that gray processing is processed and exported, be used for the text picture of described gray processing unit output is carried out the binarization unit of binary conversion treatment; The text picture that binarization unit output binary conversion treatment is crossed is to text message extraction module 601.
Perhaps, pretreatment module 603 specifically comprises: be used for described text picture is carried out the gray processing unit that gray processing is processed and exported; And be used for text picture to the unit output of described gray processing and carry out the binarization unit exported after binary conversion treatment; And the noise removing unit that is used for the text picture of described binarization unit output is removed noise processed; The text picture of noise removing unit output through removing noise processed is to text message extraction module 601.
The embodiment of the present invention is owing to extracting the text message in the text picture, according to the text message that extracts, the text picture carried out the audit of the crucial character/word that will filter, thereby can realize purpose that the text picture is examined.
In addition, the embodiment of the present invention is due in the process of carrying out character recognition, a plurality of candidate words for character, except the similarity (being font information) of foundation candidate word, also according to the transition probability between the neighboring candidate word (being semantic information), select one as the recognition result of this character from a plurality of candidate words; Thereby both outside the similarity with reference to candidate word and character, also considered the factor of this candidate word and the degree of association between rear character, and considered these factors and can greatly improve the accuracy rate of character recognition.
Further, also can decide recognition result with reference to the probability of occurrence of candidate word, further guarantee the accuracy rate of character recognition.
Further, determining many path candidates in the mode of calculating the Viterbi probability in the present invention, is a kind of preferably with the method for intercharacter incidence relation as the reference that determines recognition result, further guarantees the accuracy rate of character recognition.
One of ordinary skill in the art will appreciate that all or part of step that realizes in above-described embodiment method is to come the relevant hardware of instruction to complete by program, this program can be stored in a computer read/write memory medium, as: ROM/RAM, magnetic disc, CD etc.
The above is only the preferred embodiment of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (10)

1. a picture examination method, is characterized in that, comprising:
The text picture is carried out optical character identification OCR process, extract the text message in text picture;
The text message that extracts is carried out the coupling of crucial character/word, whether judgement wherein comprises the crucial character/word that will filter; If text picture is carried out filtration treatment.
2. the method for claim 1, is characterized in that, described, the text picture carried out the OCR processing, before extracting the text message in text picture, also comprises:
Described text picture is carried out binary conversion treatment.
3. method as claimed in claim 2, is characterized in that, described described text picture is carried out binary conversion treatment before, also comprise:
Described text picture is carried out gray processing to be processed.
4. method as claimed in claim 3, is characterized in that, described, the text picture carried out the OCR processing, before extracting the text message in text picture, also comprises: described text picture is removed noise processed.
5. described method as arbitrary in claim 1-4, is characterized in that, describedly the text picture is carried out OCR processes, and the text message that extracts in text picture specifically comprises:
Image to described text picture carries out Character segmentation;
To the character that cuts out in described word picture, divide according to setting unit; And each character of setting in unit is identified:
After each character in described setting unit is carried out feature extraction and characteristic matching, determine the candidate word of each character;
For each character in described setting unit, determine the transition probability between the candidate word of the character that each candidate word of the similarity of each candidate word of this character and this character is adjacent with this character;
According to the similarity of determining and transition probability, determine the recognition result of the character in described setting unit;
According to each recognition result of setting the character in unit, determine the text message in text picture.
6. method as claimed in claim 5, is characterized in that, the similarity that described basis is determined and transition probability, and the recognition result of determining the character in described setting unit specifically comprises:
The Viterbi probability of determining the candidate word of the 1st character in this setting unit is the similarity of this candidate word;
From the 2nd character in this setting unit, each candidate word for current character, according to the transition probability between the candidate word of front character in the similarity of this candidate word and this candidate word character adjacent with this current character, determine each candidate word and the described Viterbi probability between each candidate word of front character of described current character;
Determining current candidate word and each after the Viterbi probability between front candidate word, relatively each Viterbi probability, therefrom select maximum Viterbi probability as the Viterbi probability of current candidate word; Wherein, one of candidate word that described current candidate word is described current character, described is one of described candidate word at front character at front candidate word;
With described current candidate word as present node, select with this current candidate word between the Viterbi maximum probability front candidate word conduct adjacent with this current candidate word at front nodal point;
Determine path candidate; Wherein, the candidate word of each node in described path candidate for selecting for each character in described setting unit respectively, the adjacent node in same path candidate are determining at front nodal point according to each candidate word;
The Viterbi probability of the final node of each path candidate relatively, with the path candidate of the Viterbi maximum probability of final node as described recognition result.
7. method as claimed in claim 6, it is characterized in that, described according to this candidate word similarity and the transition probability between the candidate word of front character in this candidate word character adjacent with this current character, determine each candidate word and the described Viterbi probability between each candidate word of front character of described current character, specifically according to following formula 5 or 4:
P v=P 2* R * P v' (formula 5)
LogP v=b * logP 2+ c * logR+d * logP v' (formula 4)
Wherein, P vBe described current candidate word and described Viterbi probability between front candidate word; P 1Be the probability of occurrence of described current candidate word, P 2Be described transition probability between front candidate word and described current candidate word; R is the similarity of described current candidate word; P v' be described Viterbi probability at front candidate word; LogP v, logP 1, logP 2, logR, logP v' be respectively P v, P 1, P 2, R, P v' the value that obtains after taking the logarithm; B, c, d are respectively the weighted value of setting.
8. method as claimed in claim 5, is characterized in that, described recognition result is also determined according to the probability of occurrence of each candidate word of each character in described setting unit; And
According to the similarity of determining and transition probability, and according to the probability of occurrence of each candidate word of each character in described setting unit, the recognition result of determining the character in described setting unit specifically comprises:
Set the Viterbi probability of the candidate word of the 1st character in unit for this, determine according to the similarity of this candidate word and/or the probability of occurrence of this candidate word;
From the 2nd character in this setting unit, each candidate word for current character, the transition probability between the candidate word of front character in the character adjacent with this current character according to similarity, probability of occurrence and this candidate word of this candidate word is determined each candidate word and the described Viterbi probability between each candidate word of front character of described current character;
Determining current candidate word and each after the Viterbi probability between front candidate word, relatively each Viterbi probability, therefrom select maximum Viterbi probability as the Viterbi probability of current candidate word; Wherein, one of candidate word that described current candidate word is described current character, described is one of described candidate word at front character at front candidate word;
With described current candidate word as present node, select with this current candidate word between the Viterbi maximum probability front candidate word conduct adjacent with this current candidate word at front nodal point;
Determine path candidate; Wherein, the candidate word of each node in described path candidate for selecting for each character in described setting unit respectively, the adjacent node in same path candidate are determining at front nodal point according to each candidate word;
The Viterbi probability of the final node of each path candidate relatively, with the path candidate of the Viterbi maximum probability of final node as described recognition result.
9. a picture examination system, is characterized in that, comprising:
The text message extraction module is used for that the text picture is carried out optical character identification OCR and processes, and extracts the text message in text picture;
Filtering module is used for the text message that described text message extraction module extracts is carried out the coupling of crucial character/word, and whether judgement wherein comprises the crucial character/word that will filter; If text picture is carried out filtration treatment.
10. system as claimed in claim 9, is characterized in that, also comprises: pretreatment module;
Described pretreatment module is used for described text picture is carried out pre-service, and to the pretreated text picture of described text message extraction module output; Wherein,
Described pretreatment module specifically comprises: the binarization unit that is used for described text picture is carried out binary conversion treatment; Perhaps,
Described pretreatment module specifically comprises: be used for described text picture is carried out the gray processing unit that gray processing is processed and exported, be used for the text picture of described gray processing unit output is carried out the binarization unit of binary conversion treatment; Perhaps,
Described pretreatment module specifically comprises: be used for described text picture is carried out the gray processing unit that gray processing is processed and exported; And be used for text picture to the unit output of described gray processing and carry out the binarization unit exported after binary conversion treatment; And the noise removing unit that is used for the text picture of described binarization unit output is removed noise processed.
CN2013100587586A 2013-02-25 2013-02-25 Picture auditing method and system Pending CN103116752A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013100587586A CN103116752A (en) 2013-02-25 2013-02-25 Picture auditing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013100587586A CN103116752A (en) 2013-02-25 2013-02-25 Picture auditing method and system

Publications (1)

Publication Number Publication Date
CN103116752A true CN103116752A (en) 2013-05-22

Family

ID=48415124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013100587586A Pending CN103116752A (en) 2013-02-25 2013-02-25 Picture auditing method and system

Country Status (1)

Country Link
CN (1) CN103116752A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105518712A (en) * 2015-05-28 2016-04-20 北京旷视科技有限公司 Keyword notification method, equipment and computer program product based on character recognition
CN108734159A (en) * 2017-04-18 2018-11-02 苏宁云商集团股份有限公司 The detection method and system of sensitive information in a kind of image
CN110414450A (en) * 2019-07-31 2019-11-05 北京字节跳动网络技术有限公司 Keyword detection method, apparatus, storage medium and electronic equipment
CN110992446A (en) * 2019-12-04 2020-04-10 杭州三体视讯科技有限公司 Picture auditing method
CN111767422A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Data auditing method, device, terminal and storage medium
CN113891120A (en) * 2021-09-29 2022-01-04 广东省高峰科技有限公司 IPTV service terminal access method and system thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719454A (en) * 2005-07-15 2006-01-11 清华大学 Off-line hand writing Chinese character segmentation method with compromised geomotric cast and sematic discrimination cost
CN101510258A (en) * 2009-01-16 2009-08-19 北京中星微电子有限公司 Certificate verification method, system and certificate verification terminal
CN101887523A (en) * 2010-06-21 2010-11-17 南京邮电大学 Method for detecting image spam email by picture character and local invariant feature
CN101996180A (en) * 2009-08-12 2011-03-30 升东网络科技发展(上海)有限公司 Picture examination and filter system and method
CN102254157A (en) * 2011-07-07 2011-11-23 北京文通图像识别技术研究中心有限公司 Evaluating method for searching character segmenting position between two adjacent characters
CN102902675A (en) * 2011-07-26 2013-01-30 腾讯科技(深圳)有限公司 Picture content approval method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1719454A (en) * 2005-07-15 2006-01-11 清华大学 Off-line hand writing Chinese character segmentation method with compromised geomotric cast and sematic discrimination cost
CN101510258A (en) * 2009-01-16 2009-08-19 北京中星微电子有限公司 Certificate verification method, system and certificate verification terminal
CN101996180A (en) * 2009-08-12 2011-03-30 升东网络科技发展(上海)有限公司 Picture examination and filter system and method
CN101887523A (en) * 2010-06-21 2010-11-17 南京邮电大学 Method for detecting image spam email by picture character and local invariant feature
CN102254157A (en) * 2011-07-07 2011-11-23 北京文通图像识别技术研究中心有限公司 Evaluating method for searching character segmenting position between two adjacent characters
CN102902675A (en) * 2011-07-26 2013-01-30 腾讯科技(深圳)有限公司 Picture content approval method and device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105518712A (en) * 2015-05-28 2016-04-20 北京旷视科技有限公司 Keyword notification method, equipment and computer program product based on character recognition
CN105518712B (en) * 2015-05-28 2021-05-11 北京旷视科技有限公司 Keyword notification method and device based on character recognition
CN108734159A (en) * 2017-04-18 2018-11-02 苏宁云商集团股份有限公司 The detection method and system of sensitive information in a kind of image
CN108734159B (en) * 2017-04-18 2022-06-03 苏宁易购集团股份有限公司 Method and system for detecting sensitive information in image
CN110414450A (en) * 2019-07-31 2019-11-05 北京字节跳动网络技术有限公司 Keyword detection method, apparatus, storage medium and electronic equipment
CN110992446A (en) * 2019-12-04 2020-04-10 杭州三体视讯科技有限公司 Picture auditing method
CN111767422A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Data auditing method, device, terminal and storage medium
CN113891120A (en) * 2021-09-29 2022-01-04 广东省高峰科技有限公司 IPTV service terminal access method and system thereof

Similar Documents

Publication Publication Date Title
CN109583468B (en) Training sample acquisition method, sample prediction method and corresponding device
CN103116752A (en) Picture auditing method and system
CN102982330A (en) Method and device recognizing characters in character images
CN104881458B (en) A kind of mask method and device of Web page subject
CN109902223B (en) Bad content filtering method based on multi-mode information characteristics
CN112559800B (en) Method, apparatus, electronic device, medium and product for processing video
CN113766314B (en) Video segmentation method, device, equipment, system and storage medium
CN106776566B (en) Method and device for recognizing emotion vocabulary
CN111241813B (en) Corpus expansion method, apparatus, device and medium
KR101982990B1 (en) Method and apparatus for questioning and answering using chatbot
CN103605691A (en) Device and method used for processing issued contents in social network
CN115034220B (en) Abnormal log detection method and device, electronic equipment and storage medium
CN105869628A (en) Voice endpoint detection method and device
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
CN106708807B (en) Unsupervised participle model training method and device
CN115238105A (en) Illegal content detection method, system, equipment and medium fusing multimedia
CN103744837A (en) Multi-text comparison method based on keyword extraction
CN110287493A (en) Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN114880496A (en) Multimedia information topic analysis method, device, equipment and storage medium
CN113836297B (en) Training method and device for text emotion analysis model
CN112818984B (en) Title generation method, device, electronic equipment and storage medium
CN115392787A (en) Enterprise risk assessment method, device, equipment, storage medium and program product
CN111930885B (en) Text topic extraction method and device and computer equipment
CN110069780B (en) Specific field text-based emotion word recognition method
CN111639599A (en) Object image mining method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130522