CN115545009A

CN115545009A - Data processing system for acquiring target text

Info

Publication number: CN115545009A
Application number: CN202211527410.2A
Authority: CN
Inventors: 刘羽; 常鸿宇; 刘宸; 傅晓航
Original assignee: Zhongke Yuchen Technology Co Ltd
Current assignee: Zhongke Yuchen Technology Co Ltd
Priority date: 2022-12-01
Filing date: 2022-12-01
Publication date: 2022-12-30
Anticipated expiration: 2042-12-01
Also published as: CN115545009B

Abstract

The invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program that, when executed by the processor, performs the steps of: the method comprises the steps of obtaining an initial text character string corresponding to an initial text according to any initial text in an initial text set, obtaining an initial word vector set corresponding to the initial text character string according to the initial text character string, obtaining a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, obtaining a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and obtaining a target text corresponding to the initial text character string according to the target word vector set. The method enriches the vector characteristics, avoids omission of character characteristics, ensures higher accuracy of natural language processing, improves the accuracy of text classification, and further ensures higher accuracy of the obtained target text.

Description

Data processing system for acquiring target text

Technical Field

The invention relates to the technical field of text processing, in particular to a data processing system for acquiring a target text.

Background

With the popularization and development of the internet, text data shows explosive growth, and in the face of massive text data, how to extract meaningful information from the massive text data is a research hotspot of natural language processing, a text classification technology is a large subject of the field of natural language processing and the field of text recognition, in recent years, the text classification technology is applied to a plurality of fields of information retrieval, information push, information filtering and the like, and the time for acquiring important information of texts can be shortened by accurately classifying the texts.

Currently, in the prior art, a method for acquiring a target text is as follows: the method comprises the steps of obtaining word vectors of a text, obtaining corresponding characteristic vectors according to characters written by characters in the text, word roots and pinyin, combining the word vectors and the characteristic vectors to generate text vectors, and classifying the text vectors to obtain abnormal texts.

In summary, the method for classifying texts has the following problems: on one hand, characters in the text are limited to Chinese characters, and the limitation of text selection is increased when the text is classified; on the other hand, image characteristics and character characteristic information of characters in the text are not considered, characteristics of text characters are omitted, accuracy of natural language processing is low, accuracy of text classification is reduced, and accuracy of the obtained target text is low.

Disclosure of Invention

The invention provides a data processing system for acquiring a target text, which comprises: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor, and a memory storing a computer program that, when executed by the processor, performs the steps of:

s100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text ₁ ，A ₂ ，……，A _i ，……，A _m },A _i I =1,2, \8230;, m, m is the number of initial text characters in the initial character string corresponding to the initial text.

S200, according to A, obtaining an initial word vector set B = { B ] corresponding to A ₁ ，B ₂ ，……，B _i ，……，B _m }，B _i Is A _i The corresponding initial word vector.

S300, according to the initial image corresponding to A, obtaining a key feature vector set D = { D } corresponding to A ₁ ，D ₂ ，……，D _i ，……，D _m }，D _i Is A _i The corresponding key feature vector.

S400, according to B and D, obtaining a target word vector set U = { U } corresponding to A ₁ ，U ₂ ，……，U _i ，……，U _m }，U _i ={B _i ，D _i }。

And S500, acquiring the target text corresponding to the A according to the U.

Compared with the prior art, the data processing system for acquiring the target text has obvious beneficial effects, can achieve considerable technical progress and practicability by the technical scheme, has wide industrial utilization value, and at least has the following beneficial effects:

the invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program which, when executed by the processor, performs the steps of: according to any initial text in the initial text set, acquiring an initial text character string corresponding to the initial text, wherein the initial text characters at least comprise Chinese characters, english characters and punctuation characters, acquiring an initial word vector set corresponding to the initial text character string according to the initial text character string, acquiring a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, wherein the key feature vector comprises image features and character feature information of the initial text, the image features comprise positions, word numbers and colors of the text characters, the character feature information comprises underlines, italics and the like, acquiring a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and acquiring a target text corresponding to the initial text character string according to the target word vector set. On one hand, the characters in the text are not limited to the Chinese characters, and the limitation of text selection is reduced when the text is classified; on the other hand, the image characteristics and the character characteristic information of characters in the text are considered, omission of the character characteristics of the text is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a computer program executed by a data processing system for acquiring a target text according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in other sequences than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The present embodiment provides a data processing system for acquiring a target text, the system includes: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor and a memory storing a computer program which, when executed by the processor, performs the following steps, as shown in fig. 1:

specifically, the initial text set includes several initial texts, where the initial texts are texts including abnormal text characters, for example, the abnormal text characters are text characters having an advertisement property.

Specifically, the initial image is an image obtained by processing an initial text, wherein a person skilled in the art knows that any method for generating an image based on a text in the prior art belongs to the protection scope of the present invention, and details thereof are not described herein.

Specifically, the initial text characters at least include chinese characters, english characters, and punctuation characters.

As described above, the characters in the text are not limited to the chinese characters, and the limitation of the selection of the text is reduced when the text classification is performed.

Specifically, each initial word vector is obtained by inputting the initial text into a preset language model, and those skilled in the art know that any method for obtaining a word vector through a language model in the prior art belongs to the protection scope of the present invention, and details are not described herein.

Preferably, the preset language model is a bert model.

Specifically, the key feature vector includes a first key feature vector or a second key feature vector.

In a specific embodiment, when the key feature vector is the first key feature vector, step S300 further includes obtaining D through the following steps _i ：

S301, inputting the initial image corresponding to A into a preset OCR model, and obtaining a first candidate feature vector set G = { G } corresponding to A ₁ ，G ₂ ，……，G _i ，……，G _m }，G _i ={G _i1 ，G _i2 ，G _i3 ，G _i4 ，G _i5 }，G _i1 Is A _i Corresponding character detection box height, G _i2 Is A _i Corresponding character detection box width, G _i3 Is A _i First vertex coordinate value, G, of corresponding character detection box _i4 Is A _i Second vertex coordinate value, G, of corresponding character detection box _i5 Is A _i The character of (1) detects the frame color.

Specifically, a first vertex corresponding to the first vertex coordinate value and a second vertex corresponding to the second vertex coordinate value are diagonal vertices.

S303, according to G _i1 And G _i2 Obtaining a first feature D _i1 。

Specifically, the step S303 further includes the following steps:

s3031, acquiring the word size priority of the first preset word size and a second preset word size list H = { H = { H = ₁ ，H ₂ ，……，H _x ，……，H _p }，H _x The priority of the font size corresponding to the xth second preset font size and the font size information corresponding to the second preset font size are x =1,2, \8230, and p are the number of the preset font sizes.

Specifically, when the H is sorted from big to small according to the word size priority corresponding to the second preset word size, the word size information corresponding to the word size priority is also sorted from big to small; that is, when the priority of the font size corresponding to the preset font size is large, the size information of the font size corresponding to the font size priority is also large.

Further, the font size information includes a font size width and a font size height.

Further, the first preset font size is a preset abnormal font size.

Further, the second preset font size is a preset normal font size, and as known in the art, any font size in the prior art belongs to the protection scope of the present invention, and is not described herein again.

S3033, when | (G) _i1 /G _i2 ）-β|≤β ⁰ Obtaining A _i Corresponding size difference Δ G _i ={ΔG _i1 ，ΔG _i2 ，……，ΔG _ix ，……，ΔG _ip }，ΔG _ix Is A _i And H _x The size difference of the character sizes between the two, wherein beta is a preset size ratio, beta ⁰ Is a preset size ratio threshold.

Specifically,. DELTA.G _ix The following conditions are met:

ΔG _ix =|（G _i1 +G _i2 ）-（H _x1 +H _x2 ) L, wherein H _x1 Is H _x The size width, H, of the corresponding size information _x2 The height of the font size in the corresponding font size information.

Further, the size ratio is a ratio between a font height and a font height.

S3035, go through Δ G _i And will be Δ G _i The priority of the font size corresponding to the minimum font size difference is taken as D _i1 。

S3037, when | (G) _i1 /G _i2 ）-β|＞β ⁰ When the priority of the first preset font size is taken as D _i1 。

According to the method, the text characters are classified by judging the size of the character size, the character size of the text characters is divided into two types, one type is the priority of the character size of the first preset character size, the other type is the second preset character size, a part of abnormal text characters can be screened out, a judgment condition is provided for text classification, the accuracy of the text classification is improved, and then the accuracy of the target text is higher.

S305, according to G _i3 And G _i4 Obtaining a second feature D _i2 。

S3051, obtaining G _i3 =（G ¹ _i3 ，G ² _i3 ) And G _i4 =（G ¹ _i4 ，G ² _i4 ) Wherein G is ¹ _i3 Is G _i3 Corresponding pixel point X-axis coordinate value, G ² _i3 Is G _i3 Corresponding pixel point Y-axis coordinate value, G ¹ _i4 Is G _i4 Corresponding pixel point X-axis coordinate value, G ² _i4 Is G _i4 And the Y-axis coordinate value of the corresponding pixel point.

S3053, according to G _i3 And G _i4 Determining D _i2 =（（G ¹ _i3 +G ¹ _i4 ）/2，（G ² _i3 +G ² _i4 ）/2）。

S307, for G _i5 Processing to generate a third feature D _i3 It can be understood that: for G _i5 After the background color is removed, the generated foreground color is used as a third feature, and those skilled in the art know that any method for removing the background color in the prior art belongs to the protection scope of the present invention, and will not be described herein again.

S309, according to D _i1 ，D _i2 And D _i3 Determining D _i ={D _i1 ，D _i2 ，D _i3 }。

By the method, the initial image corresponding to the initial text can be used for acquiring the characteristic information such as the position and the size of the initial text character in the initial text, the text can be screened through the image characteristic corresponding to the text, the text character with the changed image characteristic corresponding to the initial text can be acquired more quickly, the accuracy of natural language processing is higher, and the accuracy of text classification is improved.

In another specific embodiment, when the key feature vector is the second key feature vector, step S300 further includes the following step of obtaining D _i ：

S301, inputting the initial image corresponding to A into a preset OCR model, and acquiring a second candidate feature vector set G corresponding to A ⁰ ={G ⁰ ₁ ，G ⁰ ₂ ，……，G ⁰ _i ，……，G ⁰ _m }，G ⁰ _i ={G ⁰ _i1 ，G ⁰ _i2 }，G ⁰ _i1 Is the first sub-feature vector, G ⁰ _i2 Is the second sub-feature vector.

S303, according to G ⁰ _i1 Obtaining G ⁰ _i1 Corresponding first intermediate feature vector Q ⁰ _i1 。

Specifically, G ⁰ _i1 Characteristic dimension of (2) and G in the previous embodiment _i The feature dimensions of the same are consistent, and are not described in detail herein.

Further, according to G ⁰ _i1 Obtaining Q ⁰ _i1 The method of obtaining the first key feature vector may be referred to as a method of obtaining the first key feature vector, and is not described herein again.

S305, adding G ⁰ _i2 ={G ⁰¹ _i2 ，G ⁰² _i2 ，……，G ^0y _i2 ，……，G ^0q _i2 }，G ^0y _i2 The y =1,2, \ 8230 \ 8230;, q, q are the number of character information corresponding to the character detection box, it is known to those skilled in the art that the character information corresponding to the character detection box in the prior art all belong to the protection scope of the present invention, and will not be described herein again, for example, the character information includes italics, underlines, and characters,And thickening and the like.

S307, adding G ^0y _i2 Is inputted to G ^0y _i2 In a corresponding classifier, G ^0y _i2 Corresponding second intermediate characteristic value Q ^0y _i2 So that according to all Q ^0y _i2 Constructing a second intermediate eigenvector Q ⁰ _i2 ={Q ⁰¹ _i2 ，Q ⁰² _i2 ，……，Q ^0y _i2 ，……，Q ^0q _i2 Those skilled in the art know that methods for obtaining feature values according to classifiers in the prior art are all within the scope of the present invention and are not described herein again.

Further, the step S307 further includes the steps of:

when Q is ^0y _i2 When =0, G is determined ^0y _i2 Character information exists in the corresponding character detection box.

When Q is ^0y _i2 When =1, G is determined ^0y _i2 The corresponding character detection box has no character information.

S309, according to Q ⁰ _i1 And Q ⁰ _i2 Determining D _i ={Q ⁰ _i1 ，Q ⁰ _i2 }。

In the embodiment, the image features corresponding to the initial text and the character information corresponding to the initial text are combined to serve as the key feature vectors corresponding to the initial text, so that the dimensionality of the word vectors corresponding to the text is enriched, omission of the character features of the text is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.

And S500, acquiring the target text corresponding to the A according to the U.

Specifically, the step S500 further includes the following steps:

s501, inputting U into a preset labeling model, and acquiring a target label list F = { F } corresponding to A ₁ ，F ₂ ，……，F _i ，……，F _m }，F _i Is A _i A corresponding target tag; those skilled in the art know that any method for obtaining a label through a labeling model in the prior art belongs to the protection scope of the present invention, and details are not described herein.

S503, when F _i When =1, A is determined _i And deleting abnormal characters from the initial text corresponding to the A to generate a target text corresponding to the A.

By combining the initial word vector and the key feature vector corresponding to the initial text, the method is not limited to the word vector corresponding to each character acquired by the text coding model, takes the image features and the character features of the characters into consideration, enriches the dimensionality of the text target word vector, enables the acquired vector to have abundant text feature information, improves the accuracy of text classification, and enables the accuracy of the acquired target text to be higher.

The invention provides a data processing system for acquiring a target text, which comprises: a processor and a memory storing a computer program that, when executed by the processor, performs the steps of: according to any initial text in the initial text set, acquiring an initial text character string corresponding to the initial text, wherein the initial text characters at least comprise Chinese characters, english characters and punctuation characters, acquiring an initial word vector set corresponding to the initial text character string according to the initial text character string, acquiring a key feature vector corresponding to the initial text character string according to an initial image corresponding to the initial text character string, wherein the key feature vector comprises image features and character feature information of the initial text, the image features comprise positions, word numbers and colors of the text characters, the character feature information comprises underlines, italics and the like, acquiring a target word vector set corresponding to the initial text character string according to the initial word vector set and the key feature vector, and acquiring a target text corresponding to the initial text character string according to the target word vector set. On one hand, the characters in the text are not limited to the Chinese characters, and the limitation of text selection is reduced when the text is classified; on the other hand, the image characteristics and the character characteristic information of characters in the text are considered, omission of text character characteristics is avoided, the accuracy of natural language processing is high, the accuracy of text classification is improved, and the accuracy of the obtained target text is high.

Although some specific embodiments of the present invention have been described in detail by way of illustration, it should be understood by those skilled in the art that the above illustration is only for the purpose of illustration and is not intended to limit the scope of the invention. It will also be appreciated by those skilled in the art that various modifications may be made to the embodiments without departing from the scope and spirit of the invention. The scope of the invention is defined by the appended claims.

Claims

1. A data processing system for obtaining a target text, the system comprising: an initial text set, an initial image corresponding to each initial text in the initial text set, a processor, and a memory storing a computer program that, when executed by the processor, performs the steps of:

s100, according to any initial text in the initial text set, acquiring an initial text character string A = { A } corresponding to the initial text ₁ ，A ₂ ，……，A _i ，……，A _m },A _i The method comprises the steps of obtaining initial characters of an initial character string corresponding to an initial text, wherein the initial characters are the ith initial character in the initial character string corresponding to the initial text, i =1,2, \8230, m and m are the number of the initial text characters in the initial character string corresponding to the initial text;

s200, according to A, obtaining an initial word vector set B = { B ] corresponding to A ₁ ，B ₂ ，……，B _i ，……，B _m }，B _i Is A _i A corresponding initial word vector;

s300, obtaining a key feature vector set D = { D } corresponding to A according to the initial image corresponding to A ₁ ，D ₂ ，……，D _i ，……，D _m }，D _i Is A _i Corresponding key feature vectors;

s400, according to B and D, obtaining a target word vector set U = { U } corresponding to A ₁ ，U ₂ ，……，U _i ，……，U _m }，U _i ={B _i ，D _i }；

And S500, acquiring the target text corresponding to the A according to the U.

2. The data processing system of claim 1, wherein the initial text characters include at least chinese characters, english characters, and punctuation characters.

3. The data processing system for obtaining target text according to claim 1, wherein the key feature vector comprises a first key feature vector or a second key feature vector.

4. The data processing system for obtaining a target text according to claim 3, wherein when the key feature vector is the first key feature vector, the step S300 further obtains D by _i ：

S301, inputting the initial image corresponding to the A into a preset OCR model, and acquiring a first candidate feature vector set G = { G = corresponding to the A ₁ ，G ₂ ，……，G _i ，……，G _m }，G _i ={G _i1 ，G _i2 ，G _i3 ，G _i4 ，G _i5 }，G _i1 Is A _i Corresponding character detection box height, G _i2 Is A _i Corresponding character detection box width, G _i3 Is A _i First vertex coordinate value, G, of corresponding character detection box _i4 Is A _i Second vertex coordinate value, G, of the corresponding character detection box _i5 Is A _i The character detection frame color of (1);

s303, according to G _i1 And G _i2 Obtaining a first feature D _i1 ；

S305, according to G _i3 And G _i4 Obtaining a second feature D _i2 ；

S307, for G _i5 Processing to generate a third feature D _i3 ；

5. The data processing system for obtaining a target text according to claim 4, wherein the step of S303 further comprises the steps of:

s3031, the method comprises the following steps, acquiring the font size priority of the first preset font size and a second preset font size list H = { H = (H) ₁ ，H ₂ ，……，H _x ，……，H _p }，H _x The method comprises the steps that x =1,2, \8230;, p, p is the number of preset word sizes for the priority of the word size corresponding to the xth second preset word size and the size information of the word size corresponding to the second preset word size;

s3033, when | (G) _i1 /G _i2 ）-β|≤β ⁰ Obtaining A _i Corresponding size difference Δ G _i ={ΔG _i1 ，ΔG _i2 ，……，ΔG _ix ，……，ΔG _ip }，ΔG _ix Is A _i And H _x The size difference of the character sizes between the two, wherein beta is a preset size ratio, beta ⁰ A preset size ratio threshold;

s3035, traverse Δ G _i And will be Δ G _i The priority of the font size corresponding to the smallest font size difference is taken as D _i1 ；

6. The data processing system for obtaining target text according to claim 5, wherein the first predetermined font size is a predetermined abnormal font size.

7. The data processing system for obtaining target text as claimed in claim 5, wherein the second predetermined font size is a predetermined normal font size.

8. The data processing system for obtaining target text according to claim 5, wherein the size ratio is a ratio between a font size height and a font size height.

9. The data processing system for obtaining a target text according to claim 3, wherein when the key feature vector is the second key feature vector, in the step S300, D is further obtained by the following steps _i ：

S301, inputting the initial image corresponding to A into a preset OCR model, and acquiring a second candidate feature vector set G corresponding to A ⁰ ={G ⁰ ₁ ，G ⁰ ₂ ，……，G ⁰ _i ，……，G ⁰ _m }，G ⁰ _i ={G ⁰ _i1 ，G ⁰ _i2 }，G ⁰ _i1 Is the first sub-feature vector, G ⁰ _i2 Is a second sub-feature vector;

s303, according to G ⁰ _i1 Obtaining G ⁰ _i1 Corresponding first intermediate feature vector Q ⁰ _i1 ；

S305, adding G ⁰ _i2 ={G ⁰¹ _i2 ，G ⁰² _i2 ，……，G ^0y _i2 ，……，G ^0q _i2 }，G ^0y _i2 Y =1,2, \8230;, q, q is the number of character information corresponding to the character detection box;

s307, adding G ^0y _i2 Is inputted to G ^0y _i2 In a corresponding classifier, G ^0y _i2 Corresponding second intermediate characteristic value Q ^0y _i2 So that according to all Q ^0y _i2 Constructing a second intermediate eigenvector Q ⁰ _i2 ={Q ⁰¹ _i2 ，Q ⁰² _i2 ，……，Q ^0y _i2 ，……，Q ^0q _i2 }；

S309, according to Q ⁰ _i1 And Q ⁰ _i2 DeterminingGo out D _i ={Q ⁰ _i1 ，Q ⁰ _i2 }。

10. The data processing system for obtaining a target text according to claim 1, wherein the step S500 further comprises the steps of:

s501, inputting U into a preset labeling model, and acquiring a target label list F = { F } corresponding to A ₁ ，F ₂ ，……，F _i ，……，F _m }，F _i Is A _i A corresponding target label;

s503, when F _i When =1, A is determined _i And deleting the abnormal characters from the initial text corresponding to the A to generate a target text corresponding to the A.