CN115545009A - Data processing system for acquiring target text - Google Patents

Data processing system for acquiring target text Download PDF

Info

Publication number
CN115545009A
CN115545009A CN202211527410.2A CN202211527410A CN115545009A CN 115545009 A CN115545009 A CN 115545009A CN 202211527410 A CN202211527410 A CN 202211527410A CN 115545009 A CN115545009 A CN 115545009A
Authority
CN
China
Prior art keywords
initial
text
obtaining
feature vector
size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211527410.2A
Other languages
Chinese (zh)
Other versions
CN115545009B (en
Inventor
刘羽
常鸿宇
刘宸
傅晓航
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Yuchen Technology Co Ltd
Original Assignee
Zhongke Yuchen Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongke Yuchen Technology Co Ltd filed Critical Zhongke Yuchen Technology Co Ltd
Priority to CN202211527410.2A priority Critical patent/CN115545009B/en
Publication of CN115545009A publication Critical patent/CN115545009A/en
Application granted granted Critical
Publication of CN115545009B publication Critical patent/CN115545009B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/1444Selective acquisition, locating or processing of specific regions, e.g. highlighted text, fiducial marks or predetermined fields
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program that, when executed by the processor, performs the steps of: the method comprises the steps of obtaining an initial text character string corresponding to an initial text according to any initial text in an initial text set, obtaining an initial word vector set corresponding to the initial text character string according to the initial text character string, obtaining a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, obtaining a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and obtaining a target text corresponding to the initial text character string according to the target word vector set. The method enriches the vector characteristics, avoids omission of character characteristics, ensures higher accuracy of natural language processing, improves the accuracy of text classification, and further ensures higher accuracy of the obtained target text.

Description

Data processing system for acquiring target text
Technical Field
The invention relates to the technical field of text processing, in particular to a data processing system for acquiring a target text.
Background
With the popularization and development of the internet, text data shows explosive growth, and in the face of massive text data, how to extract meaningful information from the massive text data is a research hotspot of natural language processing, a text classification technology is a large subject of the field of natural language processing and the field of text recognition, in recent years, the text classification technology is applied to a plurality of fields of information retrieval, information push, information filtering and the like, and the time for acquiring important information of texts can be shortened by accurately classifying the texts.
Currently, in the prior art, a method for acquiring a target text is as follows: the method comprises the steps of obtaining word vectors of a text, obtaining corresponding characteristic vectors according to characters written by characters in the text, word roots and pinyin, combining the word vectors and the characteristic vectors to generate text vectors, and classifying the text vectors to obtain abnormal texts.
In summary, the method for classifying texts has the following problems: on one hand, characters in the text are limited to Chinese characters, and the limitation of text selection is increased when the text is classified; on the other hand, image characteristics and character characteristic information of characters in the text are not considered, characteristics of text characters are omitted, accuracy of natural language processing is low, accuracy of text classification is reduced, and accuracy of the obtained target text is low.
Disclosure of Invention
The invention provides a data processing system for acquiring a target text, which comprises: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor, and a memory storing a computer program that, when executed by the processor, performs the steps of:
s100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text 1 ,A 2 ,……,A i ,……,A m },A i I =1,2, \8230;, m, m is the number of initial text characters in the initial character string corresponding to the initial text.
S200, according to A, obtaining an initial word vector set B = { B ] corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i Is A i The corresponding initial word vector.
S300, according to the initial image corresponding to A, obtaining a key feature vector set D = { D } corresponding to A 1 ,D 2 ,……,D i ,……,D m },D i Is A i The corresponding key feature vector.
S400, according to B and D, obtaining a target word vector set U = { U } corresponding to A 1 ,U 2 ,……,U i ,……,U m },U i ={B i ,D i }。
And S500, acquiring the target text corresponding to the A according to the U.
Compared with the prior art, the data processing system for acquiring the target text has obvious beneficial effects, can achieve considerable technical progress and practicability by the technical scheme, has wide industrial utilization value, and at least has the following beneficial effects:
the invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program which, when executed by the processor, performs the steps of: according to any initial text in the initial text set, acquiring an initial text character string corresponding to the initial text, wherein the initial text characters at least comprise Chinese characters, english characters and punctuation characters, acquiring an initial word vector set corresponding to the initial text character string according to the initial text character string, acquiring a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, wherein the key feature vector comprises image features and character feature information of the initial text, the image features comprise positions, word numbers and colors of the text characters, the character feature information comprises underlines, italics and the like, acquiring a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and acquiring a target text corresponding to the initial text character string according to the target word vector set. On one hand, the characters in the text are not limited to the Chinese characters, and the limitation of text selection is reduced when the text is classified; on the other hand, the image characteristics and the character characteristic information of characters in the text are considered, omission of the character characteristics of the text is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a computer program executed by a data processing system for acquiring a target text according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The present embodiment provides a data processing system for acquiring a target text, the system includes: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor and a memory storing a computer program which, when executed by the processor, performs the following steps, as shown in fig. 1:
specifically, the initial text set includes several initial texts, where the initial texts are texts including abnormal text characters, for example, the abnormal text characters are text characters having an advertisement property.
Specifically, the initial image is an image obtained by processing an initial text, wherein a person skilled in the art knows that any method for generating an image based on a text in the prior art belongs to the protection scope of the present invention, and details thereof are not described herein.
S100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text 1 ,A 2 ,……,A i ,……,A m },A i I =1,2, \8230;, m, m is the number of initial text characters in the initial character string corresponding to the initial text.
Specifically, the initial text characters at least include chinese characters, english characters, and punctuation characters.
As described above, the characters in the text are not limited to the chinese characters, and the limitation of the selection of the text is reduced when the text classification is performed.
S200, according to A, obtaining an initial word vector set B = { B ] corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i Is A i The corresponding initial word vector.
Specifically, each initial word vector is obtained by inputting the initial text into a preset language model, and those skilled in the art know that any method for obtaining a word vector through a language model in the prior art belongs to the protection scope of the present invention, and details are not described herein.
Preferably, the preset language model is a bert model.
S300, according to the initial image corresponding to A, obtaining a key feature vector set D = { D } corresponding to A 1 ,D 2 ,……,D i ,……,D m },D i Is A i The corresponding key feature vector.
Specifically, the key feature vector includes a first key feature vector or a second key feature vector.
In a specific embodiment, when the key feature vector is the first key feature vector, step S300 further includes obtaining D through the following steps i
S301, inputting the initial image corresponding to A into a preset OCR model, and obtaining a first candidate feature vector set G = { G } corresponding to A 1 ,G 2 ,……,G i ,……,G m },G i ={G i1 ,G i2 ,G i3 ,G i4 ,G i5 },G i1 Is A i Corresponding character detection box height, G i2 Is A i Corresponding character detection box width, G i3 Is A i First vertex coordinate value, G, of corresponding character detection box i4 Is A i Second vertex coordinate value, G, of corresponding character detection box i5 Is A i The character of (1) detects the frame color.
Specifically, a first vertex corresponding to the first vertex coordinate value and a second vertex corresponding to the second vertex coordinate value are diagonal vertices.
S303, according to G i1 And G i2 Obtaining a first feature D i1
Specifically, the step S303 further includes the following steps:
s3031, acquiring the word size priority of the first preset word size and a second preset word size list H = { H = { H = 1 ,H 2 ,……,H x ,……,H p },H x The priority of the font size corresponding to the xth second preset font size and the font size information corresponding to the second preset font size are x =1,2, \8230, and p are the number of the preset font sizes.
Specifically, when the H is sorted from big to small according to the word size priority corresponding to the second preset word size, the word size information corresponding to the word size priority is also sorted from big to small; that is, when the priority of the font size corresponding to the preset font size is large, the size information of the font size corresponding to the font size priority is also large.
Further, the font size information includes a font size width and a font size height.
Further, the first preset font size is a preset abnormal font size.
Further, the second preset font size is a preset normal font size, and as known in the art, any font size in the prior art belongs to the protection scope of the present invention, and is not described herein again.
S3033, when | (G) i1 /G i2 )-β|≤β 0 Obtaining A i Corresponding size difference Δ G i ={ΔG i1 ,ΔG i2 ,……,ΔG ix ,……,ΔG ip },ΔG ix Is A i And H x The size difference of the character sizes between the two, wherein beta is a preset size ratio, beta 0 Is a preset size ratio threshold.
Specifically,. DELTA.G ix The following conditions are met:
ΔG ix =|(G i1 +G i2 )-(H x1 +H x2 ) L, wherein H x1 Is H x The size width, H, of the corresponding size information x2 The height of the font size in the corresponding font size information.
Further, the size ratio is a ratio between a font height and a font height.
S3035, go through Δ G i And will be Δ G i The priority of the font size corresponding to the minimum font size difference is taken as D i1
S3037, when | (G) i1 /G i2 )-β|>β 0 When the priority of the first preset font size is taken as D i1
According to the method, the text characters are classified by judging the size of the character size, the character size of the text characters is divided into two types, one type is the priority of the character size of the first preset character size, the other type is the second preset character size, a part of abnormal text characters can be screened out, a judgment condition is provided for text classification, the accuracy of the text classification is improved, and then the accuracy of the target text is higher.
S305, according to G i3 And G i4 Obtaining a second feature D i2
S3051, obtaining G i3 =(G 1 i3 ,G 2 i3 ) And G i4 =(G 1 i4 ,G 2 i4 ) Wherein G is 1 i3 Is G i3 Corresponding pixel point X-axis coordinate value, G 2 i3 Is G i3 Corresponding pixel point Y-axis coordinate value, G 1 i4 Is G i4 Corresponding pixel point X-axis coordinate value, G 2 i4 Is G i4 And the Y-axis coordinate value of the corresponding pixel point.
S3053, according to G i3 And G i4 Determining D i2 =((G 1 i3 +G 1 i4 )/2,(G 2 i3 +G 2 i4 )/2)。
S307, for G i5 Processing to generate a third feature D i3 It can be understood that: for G i5 After the background color is removed, the generated foreground color is used as a third feature, and those skilled in the art know that any method for removing the background color in the prior art belongs to the protection scope of the present invention, and will not be described herein again.
S309, according to D i1 ,D i2 And D i3 Determining D i ={D i1 ,D i2 ,D i3 }。
By the method, the initial image corresponding to the initial text can be used for acquiring the characteristic information such as the position and the size of the initial text character in the initial text, the text can be screened through the image characteristic corresponding to the text, the text character with the changed image characteristic corresponding to the initial text can be acquired more quickly, the accuracy of natural language processing is higher, and the accuracy of text classification is improved.
In another specific embodiment, when the key feature vector is the second key feature vector, step S300 further includes the following step of obtaining D i
S301, inputting the initial image corresponding to A into a preset OCR model, and acquiring a second candidate feature vector set G corresponding to A 0 ={G 0 1 ,G 0 2 ,……,G 0 i ,……,G 0 m },G 0 i ={G 0 i1 ,G 0 i2 },G 0 i1 Is the first sub-feature vector, G 0 i2 Is the second sub-feature vector.
S303, according to G 0 i1 Obtaining G 0 i1 Corresponding first intermediate feature vector Q 0 i1
Specifically, G 0 i1 Characteristic dimension of (2) and G in the previous embodiment i The feature dimensions of the same are consistent, and are not described in detail herein.
Further, according to G 0 i1 Obtaining Q 0 i1 The method of obtaining the first key feature vector may be referred to as a method of obtaining the first key feature vector, and is not described herein again.
S305, adding G 0 i2 ={G 01 i2 ,G 02 i2 ,……,G 0y i2 ,……,G 0q i2 },G 0y i2 The y =1,2, \ 8230 \ 8230;, q, q are the number of character information corresponding to the character detection box, it is known to those skilled in the art that the character information corresponding to the character detection box in the prior art all belong to the protection scope of the present invention, and will not be described herein again, for example, the character information includes italics, underlines, and characters,And thickening and the like.
S307, adding G 0y i2 Is inputted to G 0y i2 In a corresponding classifier, G 0y i2 Corresponding second intermediate characteristic value Q 0y i2 So that according to all Q 0y i2 Constructing a second intermediate eigenvector Q 0 i2 ={Q 01 i2 ,Q 02 i2 ,……,Q 0y i2 ,……,Q 0q i2 Those skilled in the art know that methods for obtaining feature values according to classifiers in the prior art are all within the scope of the present invention and are not described herein again.
Further, the step S307 further includes the steps of:
when Q is 0y i2 When =0, G is determined 0y i2 Character information exists in the corresponding character detection box.
When Q is 0y i2 When =1, G is determined 0y i2 The corresponding character detection box has no character information.
S309, according to Q 0 i1 And Q 0 i2 Determining D i ={Q 0 i1 ,Q 0 i2 }。
In the embodiment, the image features corresponding to the initial text and the character information corresponding to the initial text are combined to serve as the key feature vectors corresponding to the initial text, so that the dimensionality of the word vectors corresponding to the text is enriched, omission of the character features of the text is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.
S400, according to B and D, obtaining a target word vector set U = { U } corresponding to A 1 ,U 2 ,……,U i ,……,U m },U i ={B i ,D i }。
And S500, acquiring the target text corresponding to the A according to the U.
Specifically, the step S500 further includes the following steps:
s501, inputting U into a preset labeling model, and acquiring a target label list F = { F } corresponding to A 1 ,F 2 ,……,F i ,……,F m },F i Is A i A corresponding target tag; those skilled in the art know that any method for obtaining a label through a labeling model in the prior art belongs to the protection scope of the present invention, and details are not described herein.
S503, when F i When =1, A is determined i And deleting abnormal characters from the initial text corresponding to the A to generate a target text corresponding to the A.
By combining the initial word vector and the key feature vector corresponding to the initial text, the method is not limited to the word vector corresponding to each character acquired by the text coding model, takes the image features and the character features of the characters into consideration, enriches the dimensionality of the text target word vector, enables the acquired vector to have abundant text feature information, improves the accuracy of text classification, and enables the accuracy of the acquired target text to be higher.
The invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program that, when executed by the processor, performs the steps of: according to any initial text in the initial text set, acquiring an initial text character string corresponding to the initial text, wherein the initial text characters at least comprise Chinese characters, english characters and punctuation characters, acquiring an initial word vector set corresponding to the initial text character string according to the initial text character string, acquiring a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, wherein the key feature vector comprises image features and character feature information of the initial text, the image features comprise positions, word numbers and colors of the text characters, the character feature information comprises underlines, italics and the like, acquiring a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and acquiring a target text corresponding to the initial text character string according to the target word vector set. On one hand, the characters in the text are not limited to the Chinese characters, and the limitation of text selection is reduced when the text is classified; on the other hand, the image characteristics and the character characteristic information of characters in the text are considered, omission of text character characteristics is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.
Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A data processing system for obtaining a target text, the system comprising: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor, and a memory storing a computer program that, when executed by the processor, performs the steps of:
s100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text 1 ,A 2 ,……,A i ,……,A m },A i The method comprises the steps of obtaining initial characters of an initial character string corresponding to an initial text, wherein the initial characters are the ith initial character in the initial character string corresponding to the initial text, i =1,2, \8230, m and m are the number of the initial text characters in the initial character string corresponding to the initial text;
s200, according to A, obtaining an initial word vector set B = { B ] corresponding to A 1 ,B 2 ,……,B i ,……,B m },B i Is A i A corresponding initial word vector;
s300, obtaining a key feature vector set D = { D } corresponding to A according to the initial image corresponding to A 1 ,D 2 ,……,D i ,……,D m },D i Is A i Corresponding key feature vectors;
s400, according to B and D, obtaining a target word vector set U = { U } corresponding to A 1 ,U 2 ,……,U i ,……,U m },U i ={B i ,D i };
And S500, acquiring the target text corresponding to the A according to the U.
2. The data processing system of claim 1, wherein the initial text characters include at least chinese characters, english characters, and punctuation characters.
3. The data processing system for obtaining target text according to claim 1, wherein the key feature vector comprises a first key feature vector or a second key feature vector.
4. The data processing system for obtaining a target text according to claim 3, wherein when the key feature vector is the first key feature vector, the step S300 further obtains D by i
S301, inputting the initial image corresponding to the A into a preset OCR model, and acquiring a first candidate feature vector set G = { G = corresponding to the A 1 ,G 2 ,……,G i ,……,G m },G i ={G i1 ,G i2 ,G i3 ,G i4 ,G i5 },G i1 Is A i Corresponding character detection box height, G i2 Is A i Corresponding character detection box width, G i3 Is A i First vertex coordinate value, G, of corresponding character detection box i4 Is A i Second vertex coordinate value, G, of the corresponding character detection box i5 Is A i The character detection frame color of (1);
s303, according to G i1 And G i2 Obtaining a first feature D i1
S305, according to G i3 And G i4 Obtaining a second feature D i2
S307, for G i5 Processing to generate a third feature D i3
S309, according to D i1 ,D i2 And D i3 Determining D i ={D i1 ,D i2 ,D i3 }。
5. The data processing system for obtaining a target text according to claim 4, wherein the step of S303 further comprises the steps of:
s3031, the method comprises the following steps, acquiring the font size priority of the first preset font size and a second preset font size list H = { H = (H) 1 ,H 2 ,……,H x ,……,H p },H x The method comprises the steps that x =1,2, \8230;, p, p is the number of preset word sizes for the priority of the word size corresponding to the xth second preset word size and the size information of the word size corresponding to the second preset word size;
s3033, when | (G) i1 /G i2 )-β|≤β 0 Obtaining A i Corresponding size difference Δ G i ={ΔG i1 ,ΔG i2 ,……,ΔG ix ,……,ΔG ip },ΔG ix Is A i And H x The size difference of the character sizes between the two, wherein beta is a preset size ratio, beta 0 A preset size ratio threshold;
s3035, traverse Δ G i And will be Δ G i The priority of the font size corresponding to the smallest font size difference is taken as D i1
S3037, when | (G) i1 /G i2 )-β|>β 0 When the priority of the first preset font size is taken as D i1
6. The data processing system for obtaining target text according to claim 5, wherein the first predetermined font size is a predetermined abnormal font size.
7. The data processing system for obtaining target text as claimed in claim 5, wherein the second predetermined font size is a predetermined normal font size.
8. The data processing system for obtaining target text according to claim 5, wherein the size ratio is a ratio between a font size height and a font size height.
9. The data processing system for obtaining a target text according to claim 3, wherein when the key feature vector is the second key feature vector, in the step S300, D is further obtained by the following steps i
S301, inputting the initial image corresponding to A into a preset OCR model, and acquiring a second candidate feature vector set G corresponding to A 0 ={G 0 1 ,G 0 2 ,……,G 0 i ,……,G 0 m },G 0 i ={G 0 i1 ,G 0 i2 },G 0 i1 Is the first sub-feature vector, G 0 i2 Is a second sub-feature vector;
s303, according to G 0 i1 Obtaining G 0 i1 Corresponding first intermediate feature vector Q 0 i1
S305, adding G 0 i2 ={G 01 i2 ,G 02 i2 ,……,G 0y i2 ,……,G 0q i2 },G 0y i2 Y =1,2, \8230;, q, q is the number of character information corresponding to the character detection box;
s307, adding G 0y i2 Is inputted to G 0y i2 In a corresponding classifier, G 0y i2 Corresponding second intermediate characteristic value Q 0y i2 So that according to all Q 0y i2 Constructing a second intermediate eigenvector Q 0 i2 ={Q 01 i2 ,Q 02 i2 ,……,Q 0y i2 ,……,Q 0q i2 };
S309, according to Q 0 i1 And Q 0 i2 DeterminingGo out D i ={Q 0 i1 ,Q 0 i2 }。
10. The data processing system for obtaining a target text according to claim 1, wherein the step S500 further comprises the steps of:
s501, inputting U into a preset labeling model, and acquiring a target label list F = { F } corresponding to A 1 ,F 2 ,……,F i ,……,F m },F i Is A i A corresponding target label;
s503, when F i When =1, A is determined i And deleting the abnormal characters from the initial text corresponding to the A to generate a target text corresponding to the A.
CN202211527410.2A 2022-12-01 2022-12-01 Data processing system for acquiring target text Active CN115545009B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211527410.2A CN115545009B (en) 2022-12-01 2022-12-01 Data processing system for acquiring target text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211527410.2A CN115545009B (en) 2022-12-01 2022-12-01 Data processing system for acquiring target text

Publications (2)

Publication Number Publication Date
CN115545009A true CN115545009A (en) 2022-12-30
CN115545009B CN115545009B (en) 2023-07-07

Family

ID=84721969

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211527410.2A Active CN115545009B (en) 2022-12-01 2022-12-01 Data processing system for acquiring target text

Country Status (1)

Country Link
CN (1) CN115545009B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797849A (en) * 2023-02-03 2023-03-14 以萨技术股份有限公司 Data processing system for determining abnormal behaviors based on images

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110472002A (en) * 2019-08-14 2019-11-19 腾讯科技(深圳)有限公司 A kind of text similarity acquisition methods and device
WO2020103721A1 (en) * 2018-11-19 2020-05-28 腾讯科技(深圳)有限公司 Information processing method and apparatus, and storage medium
CN111507350A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text recognition method and device
US20200320348A1 (en) * 2019-04-04 2020-10-08 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for fashion attributes extraction
CN112446259A (en) * 2019-09-02 2021-03-05 深圳中兴网信科技有限公司 Image processing method, device, terminal and computer readable storage medium
CN114022882A (en) * 2022-01-04 2022-02-08 北京世纪好未来教育科技有限公司 Text recognition model training method, text recognition device, text recognition equipment and medium
CN114581918A (en) * 2021-07-08 2022-06-03 北京金山数字娱乐科技有限公司 Text recognition model training method and device
US20220237376A1 (en) * 2021-08-25 2022-07-28 Beijing Baidu Netcom Science Technology Co., Ltd. Method, apparatus, electronic device and storage medium for text classification
WO2022156066A1 (en) * 2021-01-19 2022-07-28 平安科技(深圳)有限公司 Character recognition method and apparatus, electronic device and storage medium
WO2022227207A1 (en) * 2021-04-30 2022-11-03 平安科技(深圳)有限公司 Text classification method, apparatus, computer device, and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020103721A1 (en) * 2018-11-19 2020-05-28 腾讯科技(深圳)有限公司 Information processing method and apparatus, and storage medium
US20200320348A1 (en) * 2019-04-04 2020-10-08 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for fashion attributes extraction
CN110472002A (en) * 2019-08-14 2019-11-19 腾讯科技(深圳)有限公司 A kind of text similarity acquisition methods and device
CN112446259A (en) * 2019-09-02 2021-03-05 深圳中兴网信科技有限公司 Image processing method, device, terminal and computer readable storage medium
CN111507350A (en) * 2020-04-16 2020-08-07 腾讯科技(深圳)有限公司 Text recognition method and device
WO2022156066A1 (en) * 2021-01-19 2022-07-28 平安科技(深圳)有限公司 Character recognition method and apparatus, electronic device and storage medium
WO2022227207A1 (en) * 2021-04-30 2022-11-03 平安科技(深圳)有限公司 Text classification method, apparatus, computer device, and storage medium
CN114581918A (en) * 2021-07-08 2022-06-03 北京金山数字娱乐科技有限公司 Text recognition model training method and device
US20220237376A1 (en) * 2021-08-25 2022-07-28 Beijing Baidu Netcom Science Technology Co., Ltd. Method, apparatus, electronic device and storage medium for text classification
CN114022882A (en) * 2022-01-04 2022-02-08 北京世纪好未来教育科技有限公司 Text recognition model training method, text recognition device, text recognition equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
胡明涵: "面向领域的文本分类与挖掘关键技术研究", 《中国博士学位论文全文数据库(信息科技辑)》 *
胡明涵: "面向领域的文本分类与挖掘关键技术研究", 《中国博士学位论文全文数据库(信息科技辑)》, 15 October 2011 (2011-10-15), pages 1 - 19 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115797849A (en) * 2023-02-03 2023-03-14 以萨技术股份有限公司 Data processing system for determining abnormal behaviors based on images
CN115797849B (en) * 2023-02-03 2023-04-28 以萨技术股份有限公司 Data processing system for determining abnormal behavior based on image

Also Published As

Publication number Publication date
CN115545009B (en) 2023-07-07

Similar Documents

Publication Publication Date Title
US5373566A (en) Neural network-based diacritical marker recognition system and method
JP4504702B2 (en) Document processing apparatus, document processing method, and document processing program
EP2166488B1 (en) Handwritten word spotter using synthesized typed queries
Bataineh et al. A novel statistical feature extraction method for textual images: Optical font recognition
JPH08305803A (en) Operating method of learning machine of character template set
KR100220213B1 (en) Apparatus and method of character recognition based on 0-1 pattern histogram
CN115545009A (en) Data processing system for acquiring target text
CN112966629B (en) Remote sensing image scene classification method based on image transformation and BoF model
CN113282717B (en) Method and device for extracting entity relationship in text, electronic equipment and storage medium
Sharma et al. Primitive feature-based optical character recognition of the Devanagari script
JP2001184509A (en) Device and method for recognizing pattern and recording medium
CN113468979A (en) Text line language identification method and device and electronic equipment
JP4470913B2 (en) Character string search device and program
Soryani et al. Application of genetic algorithms to feature subset selection in a Farsi OCR
CN111814801A (en) Method for extracting labeled strings in mechanical diagram
JP2001337993A (en) Retrieval device and method for retrieving information by use of character recognition result
Abirami et al. Handwritten mathematical recognition tool
Memon et al. Glyph identification and character recognition for Sindhi OCR
CN111488400A (en) Data classification method, device and computer readable storage medium
Xu et al. Graph-based layout analysis for pdf documents
Goswami et al. High level shape representation in printed Gujarati character
Pornpanomchai et al. Printed Thai character recognition by genetic algorithm
Wang et al. Improvement of zone content classification by using background analysis
JP5683287B2 (en) Pattern recognition apparatus and pattern recognition method
Kaur et al. Adverse conditions and techniques for cross-lingual text recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant