CN115545009A - Data processing system for acquiring target text - Google Patents
Data processing system for acquiring target text Download PDFInfo
- Publication number
- CN115545009A CN115545009A CN202211527410.2A CN202211527410A CN115545009A CN 115545009 A CN115545009 A CN 115545009A CN 202211527410 A CN202211527410 A CN 202211527410A CN 115545009 A CN115545009 A CN 115545009A
- Authority
- CN
- China
- Prior art keywords
- initial
- text
- obtaining
- feature vector
- size
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/1444—Selective acquisition, locating or processing of specific regions, e.g. highlighted text, fiducial marks or predetermined fields
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program that, when executed by the processor, performs the steps of: the method comprises the steps of obtaining an initial text character string corresponding to an initial text according to any initial text in an initial text set, obtaining an initial word vector set corresponding to the initial text character string according to the initial text character string, obtaining a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, obtaining a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and obtaining a target text corresponding to the initial text character string according to the target word vector set. The method enriches the vector characteristics, avoids omission of character characteristics, ensures higher accuracy of natural language processing, improves the accuracy of text classification, and further ensures higher accuracy of the obtained target text.
Description
Technical Field
The invention relates to the technical field of text processing, in particular to a data processing system for acquiring a target text.
Background
With the popularization and development of the internet, text data shows explosive growth, and in the face of massive text data, how to extract meaningful information from the massive text data is a research hotspot of natural language processing, a text classification technology is a large subject of the field of natural language processing and the field of text recognition, in recent years, the text classification technology is applied to a plurality of fields of information retrieval, information push, information filtering and the like, and the time for acquiring important information of texts can be shortened by accurately classifying the texts.
Currently, in the prior art, a method for acquiring a target text is as follows: the method comprises the steps of obtaining word vectors of a text, obtaining corresponding characteristic vectors according to characters written by characters in the text, word roots and pinyin, combining the word vectors and the characteristic vectors to generate text vectors, and classifying the text vectors to obtain abnormal texts.
In summary, the method for classifying texts has the following problems: on one hand, characters in the text are limited to Chinese characters, and the limitation of text selection is increased when the text is classified; on the other hand, image characteristics and character characteristic information of characters in the text are not considered, characteristics of text characters are omitted, accuracy of natural language processing is low, accuracy of text classification is reduced, and accuracy of the obtained target text is low.
Disclosure of Invention
The invention provides a data processing system for acquiring a target text, which comprises: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor, and a memory storing a computer program that, when executed by the processor, performs the steps of:
s100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text 1 ,A 2 ,……,A i ,……,A m },A i I =1,2, \8230;, m, m is the number of initial text characters in the initial character string corresponding to the initial text.
S200, according to A, obtaining an initial word vector set B = { B ] corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i Is A i The corresponding initial word vector.
S300, according to the initial image corresponding to A, obtaining a key feature vector set D = { D } corresponding to A 1 ,D 2 ,……,D i ,……,D m },D i Is A i The corresponding key feature vector.
S400, according to B and D, obtaining a target word vector set U = { U } corresponding to A 1 ,U 2 ,……,U i ,……,U m },U i ={B i ,D i }。
And S500, acquiring the target text corresponding to the A according to the U.
Compared with the prior art, the data processing system for acquiring the target text has obvious beneficial effects, can achieve considerable technical progress and practicability by the technical scheme, has wide industrial utilization value, and at least has the following beneficial effects:
the invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program which, when executed by the processor, performs the steps of: according to any initial text in the initial text set, acquiring an initial text character string corresponding to the initial text, wherein the initial text characters at least comprise Chinese characters, english characters and punctuation characters, acquiring an initial word vector set corresponding to the initial text character string according to the initial text character string, acquiring a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, wherein the key feature vector comprises image features and character feature information of the initial text, the image features comprise positions, word numbers and colors of the text characters, the character feature information comprises underlines, italics and the like, acquiring a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and acquiring a target text corresponding to the initial text character string according to the target word vector set. On one hand, the characters in the text are not limited to the Chinese characters, and the limitation of text selection is reduced when the text is classified; on the other hand, the image characteristics and the character characteristic information of characters in the text are considered, omission of the character characteristics of the text is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a computer program executed by a data processing system for acquiring a target text according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The present embodiment provides a data processing system for acquiring a target text, the system includes: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor and a memory storing a computer program which, when executed by the processor, performs the following steps, as shown in fig. 1:
specifically, the initial text set includes several initial texts, where the initial texts are texts including abnormal text characters, for example, the abnormal text characters are text characters having an advertisement property.
Specifically, the initial image is an image obtained by processing an initial text, wherein a person skilled in the art knows that any method for generating an image based on a text in the prior art belongs to the protection scope of the present invention, and details thereof are not described herein.
S100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text 1 ,A 2 ,……,A i ,……,A m },A i I =1,2, \8230;, m, m is the number of initial text characters in the initial character string corresponding to the initial text.
Specifically, the initial text characters at least include chinese characters, english characters, and punctuation characters.
As described above, the characters in the text are not limited to the chinese characters, and the limitation of the selection of the text is reduced when the text classification is performed.
S200, according to A, obtaining an initial word vector set B = { B ] corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i Is A i The corresponding initial word vector.
Specifically, each initial word vector is obtained by inputting the initial text into a preset language model, and those skilled in the art know that any method for obtaining a word vector through a language model in the prior art belongs to the protection scope of the present invention, and details are not described herein.
Preferably, the preset language model is a bert model.
S300, according to the initial image corresponding to A, obtaining a key feature vector set D = { D } corresponding to A 1 ,D 2 ,……,D i ,……,D m },D i Is A i The corresponding key feature vector.
Specifically, the key feature vector includes a first key feature vector or a second key feature vector.
In a specific embodiment, when the key feature vector is the first key feature vector, step S300 further includes obtaining D through the following steps i :
S301, inputting the initial image corresponding to A into a preset OCR model, and obtaining a first candidate feature vector set G = { G } corresponding to A 1 ,G 2 ,……,G i ,……,G m },G i ={G i1 ,G i2 ,G i3 ,G i4 ,G i5 },G i1 Is A i Corresponding character detection box height, G i2 Is A i Corresponding character detection box width, G i3 Is A i First vertex coordinate value, G, of corresponding character detection box i4 Is A i Second vertex coordinate value, G, of corresponding character detection box i5 Is A i The character of (1) detects the frame color.
Specifically, a first vertex corresponding to the first vertex coordinate value and a second vertex corresponding to the second vertex coordinate value are diagonal vertices.
S303, according to G i1 And G i2 Obtaining a first feature D i1 。
Specifically, the step S303 further includes the following steps:
s3031, acquiring the word size priority of the first preset word size and a second preset word size list H = { H = { H = 1 ,H 2 ,……,H x ,……,H p },H x The priority of the font size corresponding to the xth second preset font size and the font size information corresponding to the second preset font size are x =1,2, \8230, and p are the number of the preset font sizes.
Specifically, when the H is sorted from big to small according to the word size priority corresponding to the second preset word size, the word size information corresponding to the word size priority is also sorted from big to small; that is, when the priority of the font size corresponding to the preset font size is large, the size information of the font size corresponding to the font size priority is also large.
Further, the font size information includes a font size width and a font size height.
Further, the first preset font size is a preset abnormal font size.
Further, the second preset font size is a preset normal font size, and as known in the art, any font size in the prior art belongs to the protection scope of the present invention, and is not described herein again.
S3033, when | (G) i1 /G i2 )-β|≤β 0 Obtaining A i Corresponding size difference Δ G i ={ΔG i1 ,ΔG i2 ,……,ΔG ix ,……,ΔG ip },ΔG ix Is A i And H x The size difference of the character sizes between the two, wherein beta is a preset size ratio, beta 0 Is a preset size ratio threshold.
Specifically,. DELTA.G ix The following conditions are met:
ΔG ix =|(G i1 +G i2 )-(H x1 +H x2 ) L, wherein H x1 Is H x The size width, H, of the corresponding size information x2 The height of the font size in the corresponding font size information.
Further, the size ratio is a ratio between a font height and a font height.
S3035, go through Δ G i And will be Δ G i The priority of the font size corresponding to the minimum font size difference is taken as D i1 。
S3037, when | (G) i1 /G i2 )-β|>β 0 When the priority of the first preset font size is taken as D i1 。
According to the method, the text characters are classified by judging the size of the character size, the character size of the text characters is divided into two types, one type is the priority of the character size of the first preset character size, the other type is the second preset character size, a part of abnormal text characters can be screened out, a judgment condition is provided for text classification, the accuracy of the text classification is improved, and then the accuracy of the target text is higher.
S305, according to G i3 And G i4 Obtaining a second feature D i2 。
S3051, obtaining G i3 =(G 1 i3 ,G 2 i3 ) And G i4 =(G 1 i4 ,G 2 i4 ) Wherein G is 1 i3 Is G i3 Corresponding pixel point X-axis coordinate value, G 2 i3 Is G i3 Corresponding pixel point Y-axis coordinate value, G 1 i4 Is G i4 Corresponding pixel point X-axis coordinate value, G 2 i4 Is G i4 And the Y-axis coordinate value of the corresponding pixel point.
S3053, according to G i3 And G i4 Determining D i2 =((G 1 i3 +G 1 i4 )/2,(G 2 i3 +G 2 i4 )/2)。
S307, for G i5 Processing to generate a third feature D i3 It can be understood that: for G i5 After the background color is removed, the generated foreground color is used as a third feature, and those skilled in the art know that any method for removing the background color in the prior art belongs to the protection scope of the present invention, and will not be described herein again.
S309, according to D i1 ,D i2 And D i3 Determining D i ={D i1 ,D i2 ,D i3 }。
By the method, the initial image corresponding to the initial text can be used for acquiring the characteristic information such as the position and the size of the initial text character in the initial text, the text can be screened through the image characteristic corresponding to the text, the text character with the changed image characteristic corresponding to the initial text can be acquired more quickly, the accuracy of natural language processing is higher, and the accuracy of text classification is improved.
In another specific embodiment, when the key feature vector is the second key feature vector, step S300 further includes the following step of obtaining D i :
S301, inputting the initial image corresponding to A into a preset OCR model, and acquiring a second candidate feature vector set G corresponding to A 0 ={G 0 1 ,G 0 2 ,……,G 0 i ,……,G 0 m },G 0 i ={G 0 i1 ,G 0 i2 },G 0 i1 Is the first sub-feature vector, G 0 i2 Is the second sub-feature vector.
S303, according to G 0 i1 Obtaining G 0 i1 Corresponding first intermediate feature vector Q 0 i1 。
Specifically, G 0 i1 Characteristic dimension of (2) and G in the previous embodiment i The feature dimensions of the same are consistent, and are not described in detail herein.
Further, according to G 0 i1 Obtaining Q 0 i1 The method of obtaining the first key feature vector may be referred to as a method of obtaining the first key feature vector, and is not described herein again.
S305, adding G 0 i2 ={G 01 i2 ,G 02 i2 ,……,G 0y i2 ,……,G 0q i2 },G 0y i2 The y =1,2, \ 8230 \ 8230;, q, q are the number of character information corresponding to the character detection box, it is known to those skilled in the art that the character information corresponding to the character detection box in the prior art all belong to the protection scope of the present invention, and will not be described herein again, for example, the character information includes italics, underlines, and characters,And thickening and the like.
S307, adding G 0y i2 Is inputted to G 0y i2 In a corresponding classifier, G 0y i2 Corresponding second intermediate characteristic value Q 0y i2 So that according to all Q 0y i2 Constructing a second intermediate eigenvector Q 0 i2 ={Q 01 i2 ,Q 02 i2 ,……,Q 0y i2 ,……,Q 0q i2 Those skilled in the art know that methods for obtaining feature values according to classifiers in the prior art are all within the scope of the present invention and are not described herein again.
Further, the step S307 further includes the steps of:
when Q is 0y i2 When =0, G is determined 0y i2 Character information exists in the corresponding character detection box.
When Q is 0y i2 When =1, G is determined 0y i2 The corresponding character detection box has no character information.
S309, according to Q 0 i1 And Q 0 i2 Determining D i ={Q 0 i1 ,Q 0 i2 }。
In the embodiment, the image features corresponding to the initial text and the character information corresponding to the initial text are combined to serve as the key feature vectors corresponding to the initial text, so that the dimensionality of the word vectors corresponding to the text is enriched, omission of the character features of the text is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.
S400, according to B and D, obtaining a target word vector set U = { U } corresponding to A 1 ,U 2 ,……,U i ,……,U m },U i ={B i ,D i }。
And S500, acquiring the target text corresponding to the A according to the U.
Specifically, the step S500 further includes the following steps:
s501, inputting U into a preset labeling model, and acquiring a target label list F = { F } corresponding to A 1 ,F 2 ,……,F i ,……,F m },F i Is A i A corresponding target tag; those skilled in the art know that any method for obtaining a label through a labeling model in the prior art belongs to the protection scope of the present invention, and details are not described herein.
S503, when F i When =1, A is determined i And deleting abnormal characters from the initial text corresponding to the A to generate a target text corresponding to the A.
By combining the initial word vector and the key feature vector corresponding to the initial text, the method is not limited to the word vector corresponding to each character acquired by the text coding model, takes the image features and the character features of the characters into consideration, enriches the dimensionality of the text target word vector, enables the acquired vector to have abundant text feature information, improves the accuracy of text classification, and enables the accuracy of the acquired target text to be higher.
The invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program that, when executed by the processor, performs the steps of: according to any initial text in the initial text set, acquiring an initial text character string corresponding to the initial text, wherein the initial text characters at least comprise Chinese characters, english characters and punctuation characters, acquiring an initial word vector set corresponding to the initial text character string according to the initial text character string, acquiring a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, wherein the key feature vector comprises image features and character feature information of the initial text, the image features comprise positions, word numbers and colors of the text characters, the character feature information comprises underlines, italics and the like, acquiring a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and acquiring a target text corresponding to the initial text character string according to the target word vector set. On one hand, the characters in the text are not limited to the Chinese characters, and the limitation of text selection is reduced when the text is classified; on the other hand, the image characteristics and the character characteristic information of characters in the text are considered, omission of text character characteristics is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.
Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.
Claims (10)
1. A data processing system for obtaining a target text, the system comprising: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor, and a memory storing a computer program that, when executed by the processor, performs the steps of:
s100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text 1 ,A 2 ,……,A i ,……,A m },A i The method comprises the steps of obtaining initial characters of an initial character string corresponding to an initial text, wherein the initial characters are the ith initial character in the initial character string corresponding to the initial text, i =1,2, \8230, m and m are the number of the initial text characters in the initial character string corresponding to the initial text;
s200, according to A, obtaining an initial word vector set B = { B ] corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i Is A i A corresponding initial word vector;
s300, obtaining a key feature vector set D = { D } corresponding to A according to the initial image corresponding to A 1 ,D 2 ,……,D i ,……,D m },D i Is A i Corresponding key feature vectors;
s400, according to B and D, obtaining a target word vector set U = { U } corresponding to A 1 ,U 2 ,……,U i ,……,U m },U i ={B i ,D i };
And S500, acquiring the target text corresponding to the A according to the U.
2. The data processing system of claim 1, wherein the initial text characters include at least chinese characters, english characters, and punctuation characters.
3. The data processing system for obtaining target text according to claim 1, wherein the key feature vector comprises a first key feature vector or a second key feature vector.
4. The data processing system for obtaining a target text according to claim 3, wherein when the key feature vector is the first key feature vector, the step S300 further obtains D by i :
S301, inputting the initial image corresponding to the A into a preset OCR model, and acquiring a first candidate feature vector set G = { G = corresponding to the A 1 ,G 2 ,……,G i ,……,G m },G i ={G i1 ,G i2 ,G i3 ,G i4 ,G i5 },G i1 Is A i Corresponding character detection box height, G i2 Is A i Corresponding character detection box width, G i3 Is A i First vertex coordinate value, G, of corresponding character detection box i4 Is A i Second vertex coordinate value, G, of the corresponding character detection box i5 Is A i The character detection frame color of (1);
s303, according to G i1 And G i2 Obtaining a first feature D i1 ;
S305, according to G i3 And G i4 Obtaining a second feature D i2 ;
S307, for G i5 Processing to generate a third feature D i3 ;
S309, according to D i1 ,D i2 And D i3 Determining D i ={D i1 ,D i2 ,D i3 }。
5. The data processing system for obtaining a target text according to claim 4, wherein the step of S303 further comprises the steps of:
s3031, the method comprises the following steps, acquiring the font size priority of the first preset font size and a second preset font size list H = { H = (H) 1 ,H 2 ,……,H x ,……,H p },H x The method comprises the steps that x =1,2, \8230;, p, p is the number of preset word sizes for the priority of the word size corresponding to the xth second preset word size and the size information of the word size corresponding to the second preset word size;
s3033, when | (G) i1 /G i2 )-β|≤β 0 Obtaining A i Corresponding size difference Δ G i ={ΔG i1 ,ΔG i2 ,……,ΔG ix ,……,ΔG ip },ΔG ix Is A i And H x The size difference of the character sizes between the two, wherein beta is a preset size ratio, beta 0 A preset size ratio threshold;
s3035, traverse Δ G i And will be Δ G i The priority of the font size corresponding to the smallest font size difference is taken as D i1 ;
S3037, when | (G) i1 /G i2 )-β|>β 0 When the priority of the first preset font size is taken as D i1 。
6. The data processing system for obtaining target text according to claim 5, wherein the first predetermined font size is a predetermined abnormal font size.
7. The data processing system for obtaining target text as claimed in claim 5, wherein the second predetermined font size is a predetermined normal font size.
8. The data processing system for obtaining target text according to claim 5, wherein the size ratio is a ratio between a font size height and a font size height.
9. The data processing system for obtaining a target text according to claim 3, wherein when the key feature vector is the second key feature vector, in the step S300, D is further obtained by the following steps i :
S301, inputting the initial image corresponding to A into a preset OCR model, and acquiring a second candidate feature vector set G corresponding to A 0 ={G 0 1 ,G 0 2 ,……,G 0 i ,……,G 0 m },G 0 i ={G 0 i1 ,G 0 i2 },G 0 i1 Is the first sub-feature vector, G 0 i2 Is a second sub-feature vector;
s303, according to G 0 i1 Obtaining G 0 i1 Corresponding first intermediate feature vector Q 0 i1 ;
S305, adding G 0 i2 ={G 01 i2 ,G 02 i2 ,……,G 0y i2 ,……,G 0q i2 },G 0y i2 Y =1,2, \8230;, q, q is the number of character information corresponding to the character detection box;
s307, adding G 0y i2 Is inputted to G 0y i2 In a corresponding classifier, G 0y i2 Corresponding second intermediate characteristic value Q 0y i2 So that according to all Q 0y i2 Constructing a second intermediate eigenvector Q 0 i2 ={Q 01 i2 ,Q 02 i2 ,……,Q 0y i2 ,……,Q 0q i2 };
S309, according to Q 0 i1 And Q 0 i2 DeterminingGo out D i ={Q 0 i1 ,Q 0 i2 }。
10. The data processing system for obtaining a target text according to claim 1, wherein the step S500 further comprises the steps of:
s501, inputting U into a preset labeling model, and acquiring a target label list F = { F } corresponding to A 1 ,F 2 ,……,F i ,……,F m },F i Is A i A corresponding target label;
s503, when F i When =1, A is determined i And deleting the abnormal characters from the initial text corresponding to the A to generate a target text corresponding to the A.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211527410.2A CN115545009B (en) | 2022-12-01 | 2022-12-01 | Data processing system for acquiring target text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211527410.2A CN115545009B (en) | 2022-12-01 | 2022-12-01 | Data processing system for acquiring target text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115545009A true CN115545009A (en) | 2022-12-30 |
CN115545009B CN115545009B (en) | 2023-07-07 |
Family
ID=84721969
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211527410.2A Active CN115545009B (en) | 2022-12-01 | 2022-12-01 | Data processing system for acquiring target text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115545009B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797849A (en) * | 2023-02-03 | 2023-03-14 | 以萨技术股份有限公司 | Data processing system for determining abnormal behaviors based on images |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110472002A (en) * | 2019-08-14 | 2019-11-19 | 腾讯科技(深圳)有限公司 | A kind of text similarity acquisition methods and device |
WO2020103721A1 (en) * | 2018-11-19 | 2020-05-28 | 腾讯科技(深圳)有限公司 | Information processing method and apparatus, and storage medium |
CN111507350A (en) * | 2020-04-16 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Text recognition method and device |
US20200320348A1 (en) * | 2019-04-04 | 2020-10-08 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for fashion attributes extraction |
CN112446259A (en) * | 2019-09-02 | 2021-03-05 | 深圳中兴网信科技有限公司 | Image processing method, device, terminal and computer readable storage medium |
CN114022882A (en) * | 2022-01-04 | 2022-02-08 | 北京世纪好未来教育科技有限公司 | Text recognition model training method, text recognition device, text recognition equipment and medium |
CN114581918A (en) * | 2021-07-08 | 2022-06-03 | 北京金山数字娱乐科技有限公司 | Text recognition model training method and device |
US20220237376A1 (en) * | 2021-08-25 | 2022-07-28 | Beijing Baidu Netcom Science Technology Co., Ltd. | Method, apparatus, electronic device and storage medium for text classification |
WO2022156066A1 (en) * | 2021-01-19 | 2022-07-28 | 平安科技(深圳)有限公司 | Character recognition method and apparatus, electronic device and storage medium |
WO2022227207A1 (en) * | 2021-04-30 | 2022-11-03 | 平安科技(深圳)有限公司 | Text classification method, apparatus, computer device, and storage medium |
-
2022
- 2022-12-01 CN CN202211527410.2A patent/CN115545009B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020103721A1 (en) * | 2018-11-19 | 2020-05-28 | 腾讯科技(深圳)有限公司 | Information processing method and apparatus, and storage medium |
US20200320348A1 (en) * | 2019-04-04 | 2020-10-08 | Beijing Jingdong Shangke Information Technology Co., Ltd. | System and method for fashion attributes extraction |
CN110472002A (en) * | 2019-08-14 | 2019-11-19 | 腾讯科技(深圳)有限公司 | A kind of text similarity acquisition methods and device |
CN112446259A (en) * | 2019-09-02 | 2021-03-05 | 深圳中兴网信科技有限公司 | Image processing method, device, terminal and computer readable storage medium |
CN111507350A (en) * | 2020-04-16 | 2020-08-07 | 腾讯科技(深圳)有限公司 | Text recognition method and device |
WO2022156066A1 (en) * | 2021-01-19 | 2022-07-28 | 平安科技(深圳)有限公司 | Character recognition method and apparatus, electronic device and storage medium |
WO2022227207A1 (en) * | 2021-04-30 | 2022-11-03 | 平安科技(深圳)有限公司 | Text classification method, apparatus, computer device, and storage medium |
CN114581918A (en) * | 2021-07-08 | 2022-06-03 | 北京金山数字娱乐科技有限公司 | Text recognition model training method and device |
US20220237376A1 (en) * | 2021-08-25 | 2022-07-28 | Beijing Baidu Netcom Science Technology Co., Ltd. | Method, apparatus, electronic device and storage medium for text classification |
CN114022882A (en) * | 2022-01-04 | 2022-02-08 | 北京世纪好未来教育科技有限公司 | Text recognition model training method, text recognition device, text recognition equipment and medium |
Non-Patent Citations (2)
Title |
---|
胡明涵: "面向领域的文本分类与挖掘关键技术研究", 《中国博士学位论文全文数据库(信息科技辑)》 * |
胡明涵: "面向领域的文本分类与挖掘关键技术研究", 《中国博士学位论文全文数据库(信息科技辑)》, 15 October 2011 (2011-10-15), pages 1 - 19 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115797849A (en) * | 2023-02-03 | 2023-03-14 | 以萨技术股份有限公司 | Data processing system for determining abnormal behaviors based on images |
CN115797849B (en) * | 2023-02-03 | 2023-04-28 | 以萨技术股份有限公司 | Data processing system for determining abnormal behavior based on image |
Also Published As
Publication number | Publication date |
---|---|
CN115545009B (en) | 2023-07-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5373566A (en) | Neural network-based diacritical marker recognition system and method | |
JP4504702B2 (en) | Document processing apparatus, document processing method, and document processing program | |
EP2166488B1 (en) | Handwritten word spotter using synthesized typed queries | |
Bataineh et al. | A novel statistical feature extraction method for textual images: Optical font recognition | |
JPH08305803A (en) | Operating method of learning machine of character template set | |
KR100220213B1 (en) | Apparatus and method of character recognition based on 0-1 pattern histogram | |
CN115545009A (en) | Data processing system for acquiring target text | |
CN112966629B (en) | Remote sensing image scene classification method based on image transformation and BoF model | |
CN113282717B (en) | Method and device for extracting entity relationship in text, electronic equipment and storage medium | |
Sharma et al. | Primitive feature-based optical character recognition of the Devanagari script | |
JP2001184509A (en) | Device and method for recognizing pattern and recording medium | |
CN113468979A (en) | Text line language identification method and device and electronic equipment | |
JP4470913B2 (en) | Character string search device and program | |
Soryani et al. | Application of genetic algorithms to feature subset selection in a Farsi OCR | |
CN111814801A (en) | Method for extracting labeled strings in mechanical diagram | |
JP2001337993A (en) | Retrieval device and method for retrieving information by use of character recognition result | |
Abirami et al. | Handwritten mathematical recognition tool | |
Memon et al. | Glyph identification and character recognition for Sindhi OCR | |
CN111488400A (en) | Data classification method, device and computer readable storage medium | |
Xu et al. | Graph-based layout analysis for pdf documents | |
Goswami et al. | High level shape representation in printed Gujarati character | |
Pornpanomchai et al. | Printed Thai character recognition by genetic algorithm | |
Wang et al. | Improvement of zone content classification by using background analysis | |
JP5683287B2 (en) | Pattern recognition apparatus and pattern recognition method | |
Kaur et al. | Adverse conditions and techniques for cross-lingual text recognition |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |