CN104750663A - Identification method and device for text messy codes in page - Google Patents
Identification method and device for text messy codes in page Download PDFInfo
- Publication number
- CN104750663A CN104750663A CN201310737443.4A CN201310737443A CN104750663A CN 104750663 A CN104750663 A CN 104750663A CN 201310737443 A CN201310737443 A CN 201310737443A CN 104750663 A CN104750663 A CN 104750663A
- Authority
- CN
- China
- Prior art keywords
- text
- coded format
- characteristic information
- page
- mess code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 239000000284 extract Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 abstract description 9
- 238000006243 chemical reaction Methods 0.000 description 8
- 238000010168 coupling process Methods 0.000 description 5
- 238000005859 coupling reaction Methods 0.000 description 5
- 230000008878 coupling Effects 0.000 description 4
- 239000013589 supplement Substances 0.000 description 4
- 230000009466 transformation Effects 0.000 description 4
- 230000009471 action Effects 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000035755 proliferation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention provides an identification method and an identification device for text messy codes in a page. The identification method for the text messy codes in the page includes: obtaining a first coding format of a first text to be identified in the page, converting the first text to a second text with a second coding format according to a corresponding relationship between characters corresponding to the second coding format and characters corresponding to other coding formats, and then converting the second text to a third text according to the specific corresponding relationship between the characters corresponding to the second coding format and the characters corresponding to the first coding format, and confirming whether the messy codes exist in the first text according to the third text and the first text. The identification method and the identification device for the text messy codes in the page do not need operation personnel to participate in the identification process, are easy to operate and high in accurate rate, and thereby improve identification efficiency and reliability of the text messy codes.
Description
[technical field]
The application relates to WWW (World Wide Web, Web) page treatment technology, particularly relates to a kind of recognition methods and device of page Chinese version mess code.
[background technology]
WWW (World Wide Web, Web) page can comprise by one or more HTML (Hypertext Markup Language) (HyperText Markup Language, HTML) a display block of label composition, be called page elements, such as, text, label, hyperlink, button, input frame, combobox etc.Due to the reason such as parsing of Web page, the text in Web page there will be mess code phenomenon.In prior art, need to be checked Web page one by one by operating personnel, to find whether the text in this Web page occurs mess code phenomenon.
But the identifying operation time of existing text mess code is long, and easily makes mistakes, thus result in the efficiency of the identification of text mess code and the reduction of reliability.
[summary of the invention]
The many aspects of the application provide a kind of recognition methods and device of page Chinese version mess code, in order to improve efficiency and the reliability of the identification of text mess code.
The one side of the application, provides a kind of recognition methods of page Chinese version mess code, comprising:
Obtain the first coded format of the first text to be identified in the page;
Described first text-converted is the second text by the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, and the coded format of described second text is described second coded format;
Described second text-converted is the 3rd text by the character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format;
According to described 3rd text and described first text, determine whether there is mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, and described second coded format comprises Unicode coded format.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, described according to described 3rd text and described first text, determines whether there is mess code in described first text, comprising:
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, describedly compares described 3rd text and described first text, comprising:
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, described characteristic information comprises MD5 value.
The another aspect of the application, provides a kind of recognition device of page Chinese version mess code, comprising:
Acquiring unit, for obtaining the first coded format of the first text to be identified in the page;
Described first text-converted, for the corresponding relation between the character corresponding to the second coded format and the character corresponding to other coded formats, is the second text by converting unit, and the coded format of described second text is described second coded format;
Described second text-converted, also for the corresponding relation between the character corresponding to described second coded format and the character corresponding to described first coded format, is the 3rd text by described converting unit;
Determining unit, for according to described 3rd text and described first text, determines whether there is mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, and described second coded format comprises Unicode coded format.
Aspect as above and arbitrary possible implementation, provide a kind of implementation, described determining unit further, specifically for
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation, described determining unit further, specifically for
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
Aspect as above and arbitrary possible implementation, provide a kind of implementation further, described characteristic information comprises MD5 value.
As shown from the above technical solution, the embodiment of the present application is by obtaining the first coded format of the first text to be identified in the page, and then the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, it is second text with described second coded format by described first text-converted, character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format again, be the 3rd text by described second text-converted, make it possible to according to described 3rd text and described first text, determine whether there is mess code in described first text, identifying is participated in without the need to operating personnel, simple to operate, and accuracy is high, thus improve efficiency and the reliability of the identification of text mess code.
[accompanying drawing explanation]
In order to be illustrated more clearly in the technical scheme in the embodiment of the present application, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
The schematic flow sheet of the recognition methods of the page Chinese version mess code that Fig. 1 provides for the application one embodiment;
The structural representation of the recognition device of the page Chinese version mess code that Fig. 2 provides for another embodiment of the application.
[embodiment]
For making the object of the embodiment of the present application, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present application, technical scheme in the embodiment of the present application is clearly and completely described, obviously, described embodiment is some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making other embodiments whole obtained under creative work prerequisite, all belong to the scope of the application's protection.
Be understandable that, the page involved by the application, can be the webpage (Web Page) write based on HTML (Hypertext Markup Language) (HyperText Markup Language, HTML), also can be called Web page.
It should be noted that, terminal involved in the embodiment of the present application can include but not limited to mobile phone, personal digital assistant (Personal Digital Assistant, PDA), wireless handheld device, wireless Internet access basis, PC, portable computer, PC (Personal Computer, PC), MP3 player, MP4 player etc.
In addition, term "and/or" herein, being only a kind of incidence relation describing affiliated partner, can there are three kinds of relations in expression, and such as, A and/or B, can represent: individualism A, exists A and B simultaneously, these three kinds of situations of individualism B.In addition, character "/" herein, general expression forward-backward correlation is to the relation liking a kind of "or".
The schematic flow sheet of the recognition methods of the page Chinese version mess code that Fig. 1 provides for the application one embodiment, as shown in Figure 1.
101, the first coded format of the first text to be identified in the page is obtained.
Wherein, described first coded format can be all optional text code modes in prior art, and such as, GBK coded system, UTF-8 coded system or GB2312 coded system etc., the present embodiment is not particularly limited this.
GBK is one of encode Chinese characters for computer standard, full name " Chinese Internal Code Specification " (the GBK i.e. first letter of " GB ", " expansion " Chinese phonetic alphabet, can also be called Chinese character international proliferation code, English name is Chinese Internal Code Specification).
UTF is the abbreviation of " UCS Transformation Format ", can translate into Unicode character set format transformation.
Alternatively, in one of the present embodiment possible implementation, in 101, specifically according to the relevant information of the page, the first coded format of the first text to be identified in the described page can be obtained.
Such as, can according to the META label of the page i.e. " <meta http-equiv=" Content-Type " content=" text/html; Charset=gb2312 " > ", the first coded format obtaining the first text to be identified in this page is GB2312 coded format.
Or, again such as, can according to the definition in the Cascading Style Sheet of the page (Cascading Style Sheet, CSS) file i.e. "@charset " UTF-8 " ", the first coded format obtaining the first text to be identified in this page is UTF-8 coded format.
Or, more such as, can website belonging to the page, obtain the first coded format of the first text to be identified in this page.As, the coded system that Baidu uses is GB2312 coded system, and the coded system that Google uses is UTF-8 coded system etc.
102, the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, be the second text by described first text-converted, and the coded format of described second text is described second coded format.
Alternatively, in one of the present embodiment possible implementation, described second coded format can include but not limited to Unicode coded format.The Chinese of Unicode can be translated as ten thousand country codes, international code, Unicode or single code, and it is each character but not the unique code (i.e. an integer) of glyph definition, such as, and unique binary coding.
In the process of conversion, if certain character in described first text has the corresponding character corresponding to the second coded format, so then can by the character of this character conversion corresponding to the second corresponding coded format; If certain character in described first text does not have the corresponding character corresponding to the second coded format, so then can perform former pre-configured operation, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
103, the character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format, be the 3rd text by described second text-converted.
In the process of conversion, if certain character in described second text has the corresponding character corresponding to the first coded format, so then can by the character of this character conversion corresponding to the first corresponding coded format; If certain character in described second text does not have the corresponding character corresponding to the first coded format, so then can perform former pre-configured operation, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
104, according to described 3rd text and described first text, determine whether there is mess code in described first text.
Alternatively, in one of the present embodiment possible implementation, in 104, specifically can compare described 3rd text and described first text.If described 3rd text and described first text inconsistent, then can determine to there is mess code in described first text; If or described 3rd text is consistent with described first text, then can determine to there is not mess code in described first text.
Particularly, compare two texts and described 3rd text and described first text, a lot of method can be adopted.
Such as, directly can carry out the coupling of character to two texts, judge that whether the character in two texts is consistent one by one.
Or, more such as, extract the characteristic information of described 3rd text and the characteristic information of described first text, and such as, Message Digest Algorithm 5 (Message Digest Algorithm, MD5) value; And then, the characteristic information of described 3rd text and the characteristic information of described first text are compared; If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, then can illustrate described 3rd text and described first text inconsistent; If or the characteristic information of described 3rd text is identical with the characteristic information of described first text, then can illustrate that described 3rd text is consistent with described first text.
It should be noted that, the executive agent of 101 ~ 104 can be recognition device, such as, Web page editing machine, can be arranged in local client, to carry out identified off-line, or can also be arranged in the server of network side, to carry out ONLINE RECOGNITION, the present embodiment does not limit this.
Be understandable that, described client can be mounted in the application program in terminal, or can also be a webpage of browser, if can realize page process outwardness form can, the present embodiment does not limit this.
Existing recognition methods, needs to be checked Web page one by one by operating personnel, to find whether the text in this Web page occurs mess code phenomenon.But, manually check whether mess code easily brings two problems to the page.
The first, efficiency is very low, particularly slightly large-scale website, and subpage frame just has hundreds of thousands, and operating personnel cannot check one by one;
The second, artificial cognition easily misses the mess code in the page, and such as, the situation that mess code is little in the page, word is a lot, operating personnel are difficult to naked eyes and find.
Adopt the technical scheme that the present embodiment provides, participate in without the need to operating personnel, simple to operate, and also accuracy is high.
In the present embodiment, by obtaining the first coded format of the first text to be identified in the page, and then the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, it is second text with described second coded format by described first text-converted, character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format again, be the 3rd text by described second text-converted, make it possible to according to described 3rd text and described first text, determine whether there is mess code in described first text, identifying is participated in without the need to operating personnel, simple to operate, and accuracy is high, thus improve efficiency and the reliability of the identification of text mess code.
In addition, adopt the technical scheme that the application provides, can automatically identify the mess code that the text in the page occurs, real-time is good.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the application is not by the restriction of described sequence of movement, because according to the application, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the application is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
The structural representation of the recognition device of the page Chinese version mess code that Fig. 2 provides for another embodiment of the application, as shown in Figure 2.The recognition device of the page Chinese version mess code of the present embodiment can comprise acquiring unit 21, converting unit 22 and determining unit 23.Wherein, acquiring unit 21, for obtaining the first coded format of the first text to be identified in the page; Described first text-converted, for the corresponding relation between the character corresponding to the second coded format and the character corresponding to other coded formats, is the second text by converting unit 22, and the coded format of described second text is described second coded format; Described second text-converted, also for the corresponding relation between the character corresponding to described second coded format and the character corresponding to described first coded format, is the 3rd text by described converting unit 22; Determining unit 23, for according to described 3rd text and described first text, determines whether there is mess code in described first text.
Wherein, described first coded format can be all optional text code modes in prior art, and such as, GBK coded system, UTF-8 coded system or GB2312 coded system etc., the present embodiment is not particularly limited this.
GBK is one of encode Chinese characters for computer standard, full name " Chinese Internal Code Specification " (the GBK i.e. first letter of " GB ", " expansion " Chinese phonetic alphabet, can also be called Chinese character international proliferation code, English name is Chinese Internal Code Specification).
UTF is the abbreviation of " UCS Transformation Format ", can translate into Unicode character set format transformation.
Alternatively, in one of the present embodiment possible implementation, described acquiring unit 21 specifically according to the relevant information of the page, can obtain the first coded format of the first text to be identified in the described page.
Such as, described acquiring unit 21 can according to the META label of the page i.e. " <metahttp-equiv=" Content-Type " content=" text/html; Charset=gb2312 " > ", the first coded format obtaining the first text to be identified in this page is GB2312 coded format.
Or, again such as, described acquiring unit 21 can according to the definition in the Cascading Style Sheet of the page (CascadingStyle Sheet, CSS) file i.e. "@charset " UTF-8 " ", and the first coded format obtaining the first text to be identified in this page is UTF-8 coded format.
Or more such as, described acquiring unit 21 can website belonging to the page, obtains the first coded format of the first text to be identified in this page.As, the coded system that Baidu uses is GB2312 coded system, and the coded system that Google uses is UTF-8 coded system etc.
Alternatively, in one of the present embodiment possible implementation, described second coded format can include but not limited to Unicode coded format.The Chinese of Unicode can be translated as ten thousand country codes, international code, Unicode or single code, and it is each character but not the unique code (i.e. an integer) of glyph definition, such as, and unique binary coding.
Particularly, described converting unit 22 is in the process performing first time conversion, if certain character in described first text has the corresponding character corresponding to the second coded format, so then can by the character of this character conversion corresponding to the second corresponding coded format; If certain character in described first text does not have the corresponding character corresponding to the second coded format, so then can perform former pre-configured operation, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
Particularly, described converting unit 22 is in the process performing second time conversion, if certain character in described second text has the corresponding character corresponding to the first coded format, so then can by the character of this character conversion corresponding to the first corresponding coded format; If certain character in described second text does not have the corresponding character corresponding to the first coded format, so then can the pre-configured operation of executor, such as, abandon this character, or supplement an alternatives preset, the present embodiment is not particularly limited this.
Alternatively, in one of the present embodiment possible implementation, described determining unit 23 specifically may be used for comparing described 3rd text and described first text; If described 3rd text and described first text inconsistent, then can determine to there is mess code in described first text; If or described 3rd text is consistent with described first text, then can determine to there is not mess code in described first text.
Particularly, described determining unit 23 compares two texts and described 3rd text and described first text, can adopt a lot of method.
Such as, described determining unit 23 directly can carry out the coupling of character to two texts, judge that whether the character in two texts is consistent one by one.
Or, more such as, described determining unit 23 extracts the characteristic information of described 3rd text and the characteristic information of described first text, and such as, Message Digest Algorithm 5 (Message Digest Algorithm, MD5) value; And then, the characteristic information of described 3rd text and the characteristic information of described first text are compared; If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, then can illustrate described 3rd text and described first text inconsistent; If or the characteristic information of described 3rd text is identical with the characteristic information of described first text, then can illustrate that described 3rd text is consistent with described first text.
It should be noted that, the recognition device of the page Chinese version mess code that the present embodiment provides, such as, Web page editing machine, can be arranged in local client, to carry out identified off-line, or can also be arranged in the server of network side, to carry out ONLINE RECOGNITION, the present embodiment does not limit this.
Be understandable that, described client can be mounted in the application program in terminal, or can also be a webpage of browser, if can realize page process outwardness form can, the present embodiment does not limit this.
Existing recognition device, needs to be checked Web page one by one by operating personnel, to find whether the text in this Web page occurs mess code phenomenon.But, manually check whether mess code easily brings two problems to the page.
The first, efficiency is very low, particularly slightly large-scale website, and subpage frame just has hundreds of thousands, and operating personnel cannot check one by one;
The second, artificial cognition easily misses the mess code in the page, and such as, the situation that mess code is little in the page, word is a lot, operating personnel are difficult to naked eyes and find.
Adopt the technical scheme that the present embodiment provides, participate in without the need to operating personnel, simple to operate, and also accuracy is high.
In the present embodiment, the first coded format of the first text to be identified in the page is obtained by acquiring unit, and then by the corresponding relation between the character of converting unit corresponding to the second coded format and the character corresponding to other coded formats, it is second text with described second coded format by described first text-converted, character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format again, be the 3rd text by described second text-converted, make determining unit can according to described 3rd text and described first text, determine whether there is mess code in described first text, identifying is participated in without the need to operating personnel, simple to operate, and accuracy is high, thus improve efficiency and the reliability of the identification of text mess code.
In addition, adopt the technical scheme that the application provides, can automatically identify the mess code that the text in the page occurs, real-time is good.
Those skilled in the art can be well understood to, and for convenience and simplicity of description, the system of foregoing description, the specific works process of device and unit, with reference to the corresponding process in preceding method embodiment, can not repeat them here.
In several embodiments that the application provides, should be understood that, disclosed system, apparatus and method, can realize by another way.Such as, device embodiment described above is only schematic, such as, the division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical, machinery or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the application can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form that hardware also can be adopted to add SFU software functional unit realizes.
The above-mentioned integrated unit realized with the form of SFU software functional unit, can be stored in a computer read/write memory medium.Above-mentioned SFU software functional unit is stored in a storage medium, comprising some instructions in order to make a computer equipment (can be personal computer, server, or the network equipment etc.) or processor (processor) perform the part steps of method described in each embodiment of the application.And aforesaid storage medium comprises: USB flash disk, portable hard drive, ROM (read-only memory) (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disc or CD etc. various can be program code stored medium.
Last it is noted that above embodiment is only in order to illustrate the technical scheme of the application, be not intended to limit; Although with reference to previous embodiment to present application has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein portion of techniques feature; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the spirit and scope of each embodiment technical scheme of the application.
Claims (10)
1. a recognition methods for page Chinese version mess code, is characterized in that, comprising:
Obtain the first coded format of the first text to be identified in the page;
Described first text-converted is the second text by the character corresponding to the second coded format and the corresponding relation between the character corresponding to other coded formats, and the coded format of described second text is described second coded format;
Described second text-converted is the 3rd text by the character corresponding to described second coded format and the corresponding relation between the character corresponding to described first coded format;
According to described 3rd text and described first text, determine whether there is mess code in described first text.
2. method according to claim 1, is characterized in that, described second coded format comprises Unicode coded format.
3. method according to claim 1, is characterized in that, described according to described 3rd text and described first text, determines whether there is mess code in described first text, comprising:
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
4. method according to claim 3, is characterized in that, describedly compares described 3rd text and described first text, comprising:
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
5. the method according to the arbitrary claim of Claims 1 to 4, is characterized in that, described characteristic information comprises MD5 value.
6. a recognition device for page Chinese version mess code, is characterized in that, comprising:
Acquiring unit, for obtaining the first coded format of the first text to be identified in the page;
Described first text-converted, for the corresponding relation between the character corresponding to the second coded format and the character corresponding to other coded formats, is the second text by converting unit, and the coded format of described second text is described second coded format;
Described second text-converted, also for the corresponding relation between the character corresponding to described second coded format and the character corresponding to described first coded format, is the 3rd text by described converting unit;
Determining unit, for according to described 3rd text and described first text, determines whether there is mess code in described first text.
7. device according to claim 6, is characterized in that, described second coded format comprises Unicode coded format.
8. device according to claim 6, is characterized in that, described determining unit, specifically for
Described 3rd text and described first text are compared;
If described 3rd text and described first text inconsistent, determine to there is mess code in described first text; Or
If described 3rd text is consistent with described first text, determine to there is not mess code in described first text.
9. device according to claim 8, is characterized in that, described determining unit, specifically for
Extract the characteristic information of described 3rd text and the characteristic information of described first text;
The characteristic information of described 3rd text and the characteristic information of described first text are compared;
If the characteristic information of described 3rd text is not identical with the characteristic information of described first text, illustrate described 3rd text and described first text inconsistent; Or
If the characteristic information of described 3rd text is identical with the characteristic information of described first text, illustrate that described 3rd text is consistent with described first text.
10. the device according to the arbitrary claim of claim 6 ~ 9, is characterized in that, described characteristic information comprises MD5 value.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310737443.4A CN104750663B (en) | 2013-12-27 | 2013-12-27 | The recognition methods of text messy code and device in the page |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310737443.4A CN104750663B (en) | 2013-12-27 | 2013-12-27 | The recognition methods of text messy code and device in the page |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104750663A true CN104750663A (en) | 2015-07-01 |
CN104750663B CN104750663B (en) | 2019-05-28 |
Family
ID=53590375
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310737443.4A Active CN104750663B (en) | 2013-12-27 | 2013-12-27 | The recognition methods of text messy code and device in the page |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104750663B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279247A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Expression library generation method and device |
CN106598689A (en) * | 2016-12-20 | 2017-04-26 | 绿金在线电子商务有限公司 | Universal Chinese coding method |
CN108271041A (en) * | 2016-12-30 | 2018-07-10 | 北京国双科技有限公司 | Mess code treating method and apparatus |
CN110728115A (en) * | 2018-07-17 | 2020-01-24 | 珠海金山办公软件有限公司 | Disordered code identification method and device for document content and electronic equipment |
CN111259628A (en) * | 2020-02-18 | 2020-06-09 | 北京金堤科技有限公司 | Webpage information extraction method and device, electronic equipment and storage medium |
CN113595683A (en) * | 2021-07-07 | 2021-11-02 | 西安震有信通科技有限公司 | Conversion processing method, device, terminal and medium based on various encoding files |
CN115348232A (en) * | 2022-08-10 | 2022-11-15 | 中国建设银行股份有限公司 | Decoding method, apparatus, electronic device, medium, and product |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101110072A (en) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | Device and method for automatic identifying literal code |
CN101551792A (en) * | 2008-04-03 | 2009-10-07 | 鸿富锦精密工业(深圳)有限公司 | Messy code recovery system and method |
JP2010128672A (en) * | 2008-11-26 | 2010-06-10 | Kyocera Corp | Electronic apparatus and character conversion method |
CN103150293A (en) * | 2011-12-06 | 2013-06-12 | 富泰华工业(深圳)有限公司 | Electronic device with messy code recovery function and messy code recovery method |
-
2013
- 2013-12-27 CN CN201310737443.4A patent/CN104750663B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101110072A (en) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | Device and method for automatic identifying literal code |
CN101551792A (en) * | 2008-04-03 | 2009-10-07 | 鸿富锦精密工业(深圳)有限公司 | Messy code recovery system and method |
JP2010128672A (en) * | 2008-11-26 | 2010-06-10 | Kyocera Corp | Electronic apparatus and character conversion method |
CN103150293A (en) * | 2011-12-06 | 2013-06-12 | 富泰华工业(深圳)有限公司 | Electronic device with messy code recovery function and messy code recovery method |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105279247A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Expression library generation method and device |
CN106598689A (en) * | 2016-12-20 | 2017-04-26 | 绿金在线电子商务有限公司 | Universal Chinese coding method |
CN108271041A (en) * | 2016-12-30 | 2018-07-10 | 北京国双科技有限公司 | Mess code treating method and apparatus |
CN108271041B (en) * | 2016-12-30 | 2021-01-22 | 北京国双科技有限公司 | Method and device for processing messy codes |
CN110728115A (en) * | 2018-07-17 | 2020-01-24 | 珠海金山办公软件有限公司 | Disordered code identification method and device for document content and electronic equipment |
CN110728115B (en) * | 2018-07-17 | 2024-01-26 | 珠海金山办公软件有限公司 | Document content messy code identification method and device and electronic equipment |
CN111259628A (en) * | 2020-02-18 | 2020-06-09 | 北京金堤科技有限公司 | Webpage information extraction method and device, electronic equipment and storage medium |
CN113595683A (en) * | 2021-07-07 | 2021-11-02 | 西安震有信通科技有限公司 | Conversion processing method, device, terminal and medium based on various encoding files |
CN115348232A (en) * | 2022-08-10 | 2022-11-15 | 中国建设银行股份有限公司 | Decoding method, apparatus, electronic device, medium, and product |
CN115348232B (en) * | 2022-08-10 | 2024-04-19 | 中国建设银行股份有限公司 | Decoding method, decoding device, electronic equipment, medium and product |
Also Published As
Publication number | Publication date |
---|---|
CN104750663B (en) | 2019-05-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104750663A (en) | Identification method and device for text messy codes in page | |
TWI636452B (en) | Method and system of voice recognition | |
CN112015430B (en) | JavaScript code translation method, device, computer equipment and storage medium | |
US11055373B2 (en) | Method and apparatus for generating information | |
CN101526963A (en) | Method for identifying web page coding, device and terminal equipment | |
CN102436454A (en) | Input method switching method and system for browser | |
US20150121290A1 (en) | Semantic Lexicon-Based Input Method Editor | |
CN112269862B (en) | Text role labeling method, device, electronic equipment and storage medium | |
CN104063401A (en) | Webpage style address merging method and device | |
CN110704608A (en) | Text theme generation method and device and computer equipment | |
CN112989043A (en) | Reference resolution method and device, electronic equipment and readable storage medium | |
CN113657088A (en) | Interface document analysis method and device, electronic equipment and storage medium | |
CN111159394A (en) | Text abstract generation method and device | |
CN113177407A (en) | Data dictionary construction method and device, computer equipment and storage medium | |
CN117195886A (en) | Text data processing method, device, equipment and medium based on artificial intelligence | |
CN104933030A (en) | Uygur language spelling examination method and device | |
CN109710634B (en) | Method and device for generating information | |
CN113886748A (en) | Method, device and equipment for generating editing information and outputting information of webpage content | |
CN113742501A (en) | Information extraction method, device, equipment and medium | |
CN111401009A (en) | Digital expression symbol recognition conversion method, device, server and storage medium | |
CN105183886A (en) | Webpage content extraction method and device | |
CN115965018B (en) | Training method of information generation model, information generation method and device | |
CN105353948A (en) | Information processing method and apparatus | |
CN104536948A (en) | Layout document processing method and device | |
CN103744578A (en) | Method and device for text selection on basis of focus area |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240402 Address after: Singapore Patentee after: Alibaba Singapore Holdings Ltd. Country or region after: Singapore Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands Patentee before: ALIBABA GROUP HOLDING Ltd. Country or region before: Cayman Islands |
|
TR01 | Transfer of patent right |