CN111738326B - Sentence granularity annotation training sample generation method and device - Google Patents

Sentence granularity annotation training sample generation method and device Download PDF

Info

Publication number
CN111738326B
CN111738326B CN202010551112.1A CN202010551112A CN111738326B CN 111738326 B CN111738326 B CN 111738326B CN 202010551112 A CN202010551112 A CN 202010551112A CN 111738326 B CN111738326 B CN 111738326B
Authority
CN
China
Prior art keywords
frame
initial
granularity
training sample
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010551112.1A
Other languages
Chinese (zh)
Other versions
CN111738326A (en
Inventor
卢健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202010551112.1A priority Critical patent/CN111738326B/en
Publication of CN111738326A publication Critical patent/CN111738326A/en
Application granted granted Critical
Publication of CN111738326B publication Critical patent/CN111738326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a sentence granularity labeling training sample generation method and a sentence granularity labeling training sample generation device, wherein the sentence granularity labeling training sample generation method comprises the following steps: acquiring a word granularity labeling training sample; determining a starting frame and a non-starting frame in a word granularity labeling training sample; starting a circulation flow by taking the initial frame as a current frame; in each circulation flow, searching non-initial frames meeting the preset position condition with the current frame, determining the non-initial frame closest to the current frame in all the searched non-initial frames, and updating the non-initial frame closest to the current frame into the current frame to enter the next circulation flow; generating a sentence granularity marking sentence frame according to the starting frame and the non-starting frame corresponding to the starting frame determined by the circulation flow, so as to generate a sentence granularity marking training sample corresponding to the word granularity marking training sample. The invention provides a method for generating sentence granularity labeling training samples based on word granularity labeling training samples, which improves the generation efficiency of the sentence granularity labeling training samples.

Description

Sentence granularity annotation training sample generation method and device
Technical Field
The invention relates to the field of artificial intelligence, in particular to a sentence granularity annotation training sample generation method and device.
Background
Generally OCR (Optical Character Recognition ) recognition tasks, include at least two links: text positioning and text recognition. The character positioning is a first link in OCR recognition task, namely, the position of the characters in the picture is detected, the characters are cut out from the original picture, and the characters are sent to a downstream recognition model. The downstream recognition model may be a classification model or a short text sequence recognition model. Briefly, a classification model recognizes short text word by word and then composes sentences. The sequence recognition model recognizes the whole sentence at one time, and the accuracy rate is higher than the former. And selecting different recognition models, requiring the upstream text positioning model to output cut pictures with different granularities, correspondingly outputting pictures with word granularities if the pictures are classified models, and correspondingly outputting cut pictures with sentence granularities if the pictures are sequence recognition models.
Training text positioning models with different granularities requires inputting training samples with different granularities. Typically, the annotation engineer will annotate the word granularity training samples. However, in order to use a sequence recognition model with higher accuracy, we need to use a training sample with sentence granularity, but it is not practical for the labeling staff to remark the training sample with sentence granularity, and additional labor cost is required. Therefore, how to generate sentence granularity labeling training samples based on word granularity labeling training samples is a technical problem that needs to be solved in the art.
Disclosure of Invention
The invention provides a sentence granularity annotation training sample generation method and device for solving the technical problems in the background technology.
In order to achieve the above object, according to one aspect of the present invention, there is provided a sentence granularity annotation training sample generation method, the method comprising:
acquiring respective coordinate information of each text frame in the word granularity labeling training sample and the number of characters in each text frame;
determining a starting frame and a non-starting frame in all the text frames according to the number of characters and the coordinate information;
starting a circulation flow by taking the initial frame as a current frame; searching non-initial frames meeting a preset position condition with the current frame in each circulation flow, determining non-initial frames closest to the current frame in all the searched non-initial frames, and updating the non-initial frames closest to the current frame into the current frame to enter the next circulation flow; if a non-initial frame meeting the preset position condition with the current frame cannot be found, stopping the circulation flow;
generating a sentence granularity marking sentence frame according to the starting frame and the non-starting frame corresponding to the starting frame determined by the circulation flow, so as to generate a sentence granularity marking training sample corresponding to the word granularity marking training sample.
Optionally, the determining the initial frame and the non-initial frame in all the text frames according to the number of characters and the coordinate information includes:
determining a text box with the number of characters equal to a preset value as a starting box; and
and determining the text box without other text boxes in the left preset range as a starting box.
Optionally, the preset position condition includes: a first condition and a second condition;
the first condition includes: the left abscissa of the non-initial frame is larger than the average value of the left abscissa and the right abscissa of the current frame;
the second condition includes: the lower side height of the non-initial frame is smaller than or equal to the upper side height of the current frame, and/or the upper side height of the non-initial frame is larger than or equal to the lower side height of the current frame.
Optionally, the sentence granularity annotation training sample generation method further includes:
determining the initial frames in the same row in all the initial frames according to the coordinate information of each initial frame;
before the updating the non-initial frame closest to the current frame, the method further comprises:
judging whether a starting frame which is in the same row as the starting frame of the current frame when the circulation flow is started exists between the non-starting frame which is closest to the starting frame and the starting frame which is the current frame when the circulation flow is started;
if not, the non-initial frame closest to the current frame is updated.
Optionally, the generating the sentence granularity annotation sentence frame according to the start frame and the non-start frame corresponding to the start frame determined by the loop flow includes:
generating an external rectangle containing the initial frame and a non-initial frame corresponding to the initial frame determined by the circulation flow, determining coordinate information of the external rectangle, and generating the sentence granularity annotation sentence frame.
To achieve the above object, according to another aspect of the present invention, there is provided a sentence granularity annotation training sample generation apparatus, comprising:
the character granularity labeling training sample acquisition unit is used for acquiring the respective coordinate information of each character frame and the number of characters in each character frame in the character granularity labeling training sample;
a start frame determining unit, configured to determine a start frame and a non-start frame in all the text frames according to the number of characters and the coordinate information;
a non-initial frame query unit, configured to start a circulation flow by using the initial frame as a current frame; searching non-initial frames meeting a preset position condition with the current frame in each circulation flow, determining non-initial frames closest to the current frame in all the searched non-initial frames, and updating the non-initial frames closest to the current frame into the current frame to enter the next circulation flow; if a non-initial frame meeting the preset position condition with the current frame cannot be found, stopping the circulation flow;
and the sentence granularity labeling training sample generation unit is used for generating a sentence granularity labeling sentence frame according to the starting frame and a non-starting frame corresponding to the starting frame determined through the circulation flow so as to generate a sentence granularity labeling training sample corresponding to the word granularity labeling training sample.
To achieve the above object, according to another aspect of the present invention, there is also provided a computer device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the sentence-granularity annotation training sample generation method described above when the computer program is executed.
To achieve the above object, according to another aspect of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed in a computer processor, implements the steps in the sentence-granularity annotation training sample generation method described above.
The beneficial effects of the invention are as follows: according to the method, the initial frames and the non-initial frames in all text frames in the word granularity labeling training samples are determined based on the word granularity labeling training samples, the non-initial frames corresponding to the initial frames are searched for aiming at the initial frames, and finally sentence granularity labeling sentence frames are generated according to the initial frames and the searched non-initial frames corresponding to the initial frames, so that corresponding sentence granularity labeling training samples are generated according to the word granularity labeling training samples, the generation efficiency of the sentence granularity labeling training samples is improved, and the training efficiency of a sequence recognition model and the accuracy of the model are further improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. In the drawings:
FIG. 1 is a flowchart of a method for generating a sentence granularity annotation training sample according to an embodiment of the present invention;
FIG. 2 is a flow chart of a determination of a start block according to an embodiment of the present invention;
FIG. 3 is a flow chart of an embodiment of the present invention for determining a neighboring non-starting box to the right of a starting box;
FIG. 4 is a schematic diagram of a word granularity labeling training sample according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of text box coordinate information according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of sentence granularity annotation training samples generated by an embodiment of the present invention;
FIG. 7 is a block diagram of a device for generating a training sample for sentence granularity annotation according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a computer device according to an embodiment of the invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
It is noted that the terms "comprises" and "comprising," and any variations thereof, in the description and claims of the present invention and in the foregoing figures, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
Fig. 1 is a flowchart of a sentence granularity annotation training sample generation method according to an embodiment of the present invention, and as shown in fig. 1, the sentence granularity annotation training sample generation method of the present embodiment includes steps S101 to S104.
Step S101, acquiring respective coordinate information of each text frame in the word granularity labeling training sample and the number of characters in each text frame.
In the embodiment of the invention, the word granularity labeling training sample is a picture for labeling the words to be identified, and the word granularity labeling and the sentence granularity labeling are different in that the word granularity labeling is labeling all the words to be identified in the picture, and the sentence granularity labeling is labeling sentences to be identified in the picture. The existing text recognition model includes: the classification model trained by the word granularity labeling training sample and the sequence recognition model trained by the sentence granularity labeling training sample. Each word in the picture is identified one by the classification model, and sentences are formed; the sequence recognition model is used for recognizing the whole sentence at one time, and the accuracy is higher than that of the classification model.
In an alternative embodiment of the present invention, the word granularity labeling training sample and the sentence granularity labeling training sample of the present invention may relate to a plurality of different word recognition scenarios, for example, to recognize date data and recognize structural data filled by a user, and in the following embodiment of the present invention, a scenario of recognizing date data is taken as an example, but the present invention is not limited thereto.
Fig. 4 is a schematic diagram of a training sample for marking the word granularity of a scene for identifying a date, and as shown in fig. 4, the word granularity marking training sample is used for marking the year, month and day in the date data through text boxes. In the embodiment of the invention, the character granularity marking training sample is also marked with the respective coordinate information of each text frame and the number of characters in each text frame.
In an alternative embodiment of the present invention, the coordinate information of the text box may be represented by coordinates of four corners of the text box. In an alternative embodiment of the invention, as shown in fig. 5, the text box is a rectangular box, the sides of the rectangle being parallel to the coordinate axes, the lower left corner coordinates of the text box being indicated by (x 1, y 2), the upper left corner coordinates being indicated by (x 1, y 1), the lower right corner coordinates being indicated by (x 2, y 2), and the upper right corner coordinates being indicated by (x 2, y 1).
In the embodiment of the invention, since the number or the specific symbol is often required to be identified in the text recognition, and the characters such as the number or the specific symbol cannot be marked word by word, a way of marking the characters together, for example, the year in date data, is often adopted, as shown in fig. 5, the year 2020 is marked in one text box, and the same is true for the month and the day, so that the number of the characters in each text box is different. The training sample of the word granularity marking of the invention also marks the number of characters in each text frame.
In an alternative embodiment of the present invention, the coordinate information of each text frame in the training sample with the granularity of words and the number of characters in each text frame may be displayed in a table form, for example, table 1 below:
Figure BDA0002542560520000051
Figure BDA0002542560520000061
TABLE 1
In the embodiment shown in table 1, two date data are included, each date data corresponds to three text boxes, six text boxes are included, the coordinate information of each text box and the number of characters in each text box are described in table 1, the number of characters in each text box is four, and the number of characters in each text box can be one character or two characters in each month and day.
And step S102, determining a start frame and a non-start frame in all the text frames according to the number of characters and the coordinate information.
In the embodiment of the invention, the generation of the corresponding sentence granularity labeling training sample based on the word granularity labeling training sample is mainly realized to form sentences by connecting each word in series, so that the determination of the start of the sentences is particularly important. The method comprises the steps of determining a starting frame according to coordinate information of each text frame and the number of characters of each text frame, and dividing all text frames in a word granularity labeling training sample into the starting frame and a non-starting frame.
In the embodiment of the invention, other words are not always present in a certain range on the left side of the initial word of the sentence, and the invention can determine whether other words are present in a certain range on the left side of each text frame based on the theory through the coordinate information of each text frame.
In the embodiment of the invention, the text recognition is usually performed on structural sentences, for example, the date data starts with years, and in other recognition scenes, the start of the sentences can also be information with fixed character numbers such as enterprise numbers, identity card numbers and the like. The present invention can determine the start box according to the number of characters of each text box.
In the embodiment of the invention, after determining the initial frame in all the text frames in the word granularity labeling training sample, the rest text frames are non-initial frames.
The split start frame information and non-start frame information can be combined with the information of each text frame in table 1 as follows in table 2 and table 3:
Figure BDA0002542560520000062
TABLE 2
Figure BDA0002542560520000063
TABLE 3 Table 3
Wherein, table 2 is the coordinate information of the initial frame, and table 3 is the coordinate information of the non-initial frame.
Step S103, taking the initial frame as a current frame, and starting a circulation flow; searching non-initial frames meeting a preset position condition with the current frame in each circulation flow, determining non-initial frames closest to the current frame in all the searched non-initial frames, and updating the non-initial frames closest to the current frame into the current frame to enter the next circulation flow; if the non-initial frame meeting the preset position condition with the current frame cannot be found, stopping the circulation flow.
In the embodiment of the invention, the step starts with a starting frame, searches for a non-starting frame adjacent to the right side of the starting frame, and then searches for a non-starting frame adjacent to the right side again based on the searched non-starting frame until all non-starting frames corresponding to the starting frame are searched for. Taking date data as an example, the initial frame is taken as a year, firstly, the adjacent non-initial frame on the right side of the initial frame, namely month, then, the adjacent non-initial frame on the right side, namely day, is searched again based on the non-initial frame of month, and finally, all non-initial frames (month and day) corresponding to the year of the initial frame are obtained.
In an embodiment of the present invention, the preset position conditions in the step include: a first condition and a second condition; the first condition includes: the left abscissa of the non-initial frame is larger than the average value of the left abscissa and the right abscissa of the current frame; the second condition includes: the lower side height of the non-initial frame is smaller than or equal to the upper side height of the current frame, and/or the upper side height of the non-initial frame is larger than or equal to the lower side height of the current frame.
Based on the embodiment shown in fig. 5, the first condition described above can be expressed by the following formula:
Figure BDA0002542560520000071
wherein a represents the current frame (start frame), b represents the non-start frame, b: x 1 Refers to x in the coordinate information of the non-initial frame b 1 Other similar features will not be described in detail.
Based on the embodiment shown in fig. 5, the second condition described above can be expressed by the following formula:
b:y 2 ≤a:y 1 ,b:y 1 ≥a:y 2
where a represents the current box (start box) and b represents the non-start box.
In an optional embodiment of the present invention, the present invention may further extract the coordinates of the upper right corner of the plurality of first text frames and the coordinates of the upper left corner of the plurality of second text frames, and train out a right-side neighboring text frame recognition model by adopting a KNN algorithm, where the second text frames are right-side neighboring text frames of the first text frames, the first text frames may be an initial frame or a non-initial frame, and the second text frames are non-initial frames. The right-side adjacent text frame recognition model is used for determining a plurality of suspected right-side adjacent text frames corresponding to the target text frame, specifically, the right-upper corner coordinates of the target text frame and the left-upper corner coordinates of all other text frames (non-initial frames) are input into the right-side adjacent text frame recognition model, and the model outputs the plurality of text frames corresponding to the target text frame as the suspected right-side adjacent text frames of the target text frame.
In an alternative embodiment of the present invention, in each circulation flow in step S103, a non-initial frame that meets a preset position condition with the current frame is searched, and the upper right corner coordinates of the current frame and the upper left corner coordinates of all non-initial frames may be input into the right adjacent text frame recognition model, where the model outputs a plurality of non-initial frames corresponding to the current frame, and further searches for a non-initial frame that meets a preset position condition with the current frame from the plurality of non-initial frames output by the model, so that the search is more accurate and the search efficiency is higher.
And step S104, generating a sentence granularity labeling sentence frame according to the starting frame and the non-starting frame corresponding to the starting frame determined by the circulation flow, so as to generate a sentence granularity labeling training sample corresponding to the word granularity labeling training sample.
In an embodiment of the present invention, the step may specifically be: generating an external rectangle containing the initial frame and a non-initial frame corresponding to the initial frame determined by the circulation flow, determining coordinate information of the external rectangle, and generating the sentence granularity annotation sentence frame. Sentence-granularity annotation training samples generated based on the word-granularity annotation training samples shown in fig. 4 may be as shown in fig. 6.
In an embodiment of the present invention, the generated bounding rectangle is a bounding rectangle including a start frame and a non-start frame corresponding to the start frame with a minimum area.
According to the embodiment, the initial frames and the non-initial frames in all text frames in the word granularity labeling training sample are determined based on the word granularity labeling training sample, the non-initial frames corresponding to the initial frames are searched for aiming at the initial frames, and finally sentence granularity labeling sentence frames are generated according to the initial frames and the searched non-initial frames corresponding to the initial frames, so that the corresponding sentence granularity labeling training sample is generated according to the word granularity labeling training sample, the generation efficiency of the sentence granularity labeling training sample is improved, and the training efficiency of the sequence recognition model and the accuracy of the model are further improved.
Fig. 2 is a flowchart of determining a start frame according to an embodiment of the present invention, as shown in fig. 2, in an embodiment of the present invention, the step S102 of determining a start frame and a non-start frame in all the text frames according to the number of characters and the coordinate information specifically includes a step S201 and a step S202.
In step S201, a text box with the number of characters equal to a preset value is determined as a start box.
In the embodiment of the invention, for identifying scenes according to date data, the preset value is four, for identifying scenes according to other texts, the preset value is set according to the scenes, for example, the text starting from an identity card number, and the preset value is set to 18.
In step S202, a text box in which no other text box exists in the left preset range is determined as a start box.
In the embodiment of the present invention, the left preset range may be a preset range from an upper left corner of a text box, and this step may specifically be that if there is no upper right corner or lower right corner of another text box within a certain distance range from the upper left corner of one text box, then determining that the text box is a starting box. In the embodiment of the invention, the preset range is determined according to the actual size of the sample.
In an embodiment of the present invention, the sentence granularity annotation training sample generating method of the present invention further includes: and determining the initial frames in the same row in all the initial frames according to the coordinate information of each initial frame.
In the embodiment of the invention, the situation that two initial frames are positioned in the same row also occurs in the sentence granularity labeling training sample, and the scene combined with date identification is exemplified as that two date data occur in the same row. For this case, the non-start frames adjacent to the right side of the sequentially determined start frame in step S103 may be wrong to correspond the non-start frame originally corresponding to the start frame B to the start frame a located in the same row as B. Therefore, in an alternative embodiment of the present invention, step S103 further needs to determine the start frame located in the same row when determining the adjacent non-start frame on the right side of the start frame, and the specific steps are shown in fig. 3.
In connection with the embodiment shown in fig. 5, the present invention can determine the start box a in connection with the following formula 0 To the right of (a)Whether there is a start box a in the same row 1
a 0 :x 1 <a 1 :x 1
And is also provided with
(a 1 :y 1 <a 0 :y 2 Or a 1 :y 2 >a 0 :y 1 )
Fig. 3 is a flowchart of determining a non-start frame adjacent to the right side of the start frame according to an embodiment of the present invention, as shown in fig. 3, in an embodiment of the present invention, the above-mentioned process of determining a non-start frame adjacent to the right side of the start frame in step S103 further includes steps S301 to S302.
Step S301, determining whether there is a start frame in the same line as the start frame of the current frame when the loop flow is started between the non-start frame closest to the start frame and the start frame of the current frame when the loop flow is started.
In the embodiment of the present invention, this step may determine whether the start frame in the same line as the start frame as the current frame in the start of the circulation process is between the non-start frame closest to the start of the circulation process and the start frame as the current frame in the start of the circulation process according to the coordinate information of the non-start frame closest to the start of the circulation process, the coordinate information of the start frame as the current frame in the start of the circulation process, and the coordinate information of the start frame in the same line as the start frame as the current frame in the start of the circulation process.
In step S302, if not, the non-start frame closest to the current frame is updated.
In the embodiment of the present invention, if any, the loop flow in step S103 is stopped.
According to the embodiment, the method for generating the corresponding sentence granularity labeling training samples according to the word granularity labeling training samples is provided, so that the generation efficiency of the sentence granularity labeling training samples is effectively improved, the sentence granularity labeling training samples can be generated without manual secondary labeling, and the training efficiency of a sequence identification model and the accuracy of the model are improved.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowcharts, in some cases the steps illustrated or described may be performed in an order other than that illustrated herein.
Based on the same inventive concept, the embodiment of the invention also provides a sentence granularity annotation training sample generation device, which can be used for realizing the sentence granularity annotation training sample generation method described in the above embodiment, as described in the following embodiments. Because the principle of solving the problem of the sentence-granularity labeling training sample generation device is similar to that of the sentence-granularity labeling training sample generation method, the embodiment of the sentence-granularity labeling training sample generation device can refer to the embodiment of the sentence-granularity labeling training sample generation method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
Fig. 7 is a block diagram of a structure of a sentence granularity annotation training sample generation apparatus according to an embodiment of the present invention, as shown in fig. 7, the sentence granularity annotation training sample generation apparatus according to an embodiment of the present invention includes: the system comprises a word granularity labeling training sample acquisition unit 1, a start frame determination unit 2, a non-start frame query unit 3 and a sentence granularity labeling training sample generation unit 4.
The word granularity labeling training sample acquiring unit 1 is used for acquiring the respective coordinate information of each text frame and the number of characters in each text frame in the word granularity labeling training sample.
And a start frame determining unit 2, configured to determine a start frame and a non-start frame in all the text frames according to the number of characters and the coordinate information.
A non-initial frame query unit 3, configured to start a circulation flow by using the initial frame as a current frame; searching non-initial frames meeting a preset position condition with the current frame in each circulation flow, determining non-initial frames closest to the current frame in all the searched non-initial frames, and updating the non-initial frames closest to the current frame into the current frame to enter the next circulation flow; if the non-initial frame meeting the preset position condition with the current frame cannot be found, stopping the circulation flow.
And the sentence granularity labeling training sample generating unit 4 is used for generating a sentence granularity labeling sentence frame according to the initial frame and a non-initial frame corresponding to the initial frame determined by the circulation flow so as to generate a sentence granularity labeling training sample corresponding to the word granularity labeling training sample.
In an embodiment of the present invention, the start frame determining unit 2 specifically includes:
a first determining module, configured to determine a text box with a number of characters equal to a preset value as a start box; and
and the second determining module is used for determining a text box which does not exist other text boxes in the left preset range as a starting box.
In an embodiment of the present invention, the preset location condition includes: a first condition and a second condition;
the first condition includes: the left abscissa of the non-initial frame is larger than the average value of the left abscissa and the right abscissa of the current frame;
the second condition includes: the lower side height of the non-initial frame is smaller than or equal to the upper side height of the current frame, and/or the upper side height of the non-initial frame is larger than or equal to the lower side height of the current frame.
In an embodiment of the present invention, the sentence granularity annotation training sample generation device of the present invention further includes:
and the same-line detection unit is used for determining the initial frames in the same line in all the initial frames according to the coordinate information of each initial frame.
In an embodiment of the present invention, the non-initial frame query unit 3 updates the non-initial frame closest to the current frame, and specifically includes:
the non-initial frame query unit 3 determines whether an initial frame in the same line as the initial frame as the current frame when the loop process is started exists between the non-initial frame closest to the initial frame and the initial frame as the current frame when the loop process is started, and if not, updates the non-initial frame closest to the initial frame as the current frame.
In one embodiment of the present invention, the sentence granularity annotation training sample generation unit 4 includes:
and the sentence granularity labeling sentence frame generation module is used for generating an external rectangle containing the initial frame and the non-initial frame corresponding to the initial frame determined by the circulation flow, and determining coordinate information of the external rectangle so as to generate the sentence granularity labeling sentence frame.
To achieve the above object, according to another aspect of the present application, there is also provided a computer apparatus. As shown in fig. 8, the computer device includes a memory, a processor, a communication interface, and a communication bus, where a computer program executable on the processor is stored on the memory, and when the processor executes the computer program, the steps in the method of the above embodiment are implemented.
The processor may be a central processing unit (Central Processing Unit, CPU). The processor may also be any other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof.
The memory is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and units, such as corresponding program units in the above-described method embodiments of the invention. The processor executes the various functional applications of the processor and the processing of the composition data by running non-transitory software programs, instructions and modules stored in the memory, i.e., implementing the methods of the method embodiments described above.
The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store data created by the processor, etc. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory may optionally include memory located remotely from the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The one or more units are stored in the memory, which when executed by the processor, performs the method in the above embodiments.
The details of the computer device may be correspondingly understood by referring to the corresponding relevant descriptions and effects in the above embodiments, and will not be repeated here.
To achieve the above object, according to another aspect of the present application, there is also provided a computer-readable storage medium storing a computer program that, when executed in a computer processor, implements the steps in the sentence-granularity annotation training sample generation method described above. It will be appreciated by those skilled in the art that implementing all or part of the above-described embodiment method may be implemented by a computer program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program may include the above-described embodiment method when executed. Wherein the storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a random access Memory (RandomAccessMemory, RAM), a Flash Memory (Flash Memory), a Hard Disk (HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.
It will be apparent to those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, or they may alternatively be implemented in program code executable by computing devices, such that they may be stored in a memory device for execution by the computing devices, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (12)

1. The sentence granularity annotation training sample generation method is characterized by comprising the following steps of:
acquiring respective coordinate information of each text frame in the word granularity labeling training sample and the number of characters in each text frame;
determining a starting frame and a non-starting frame in all the text frames according to the number of characters and the coordinate information;
starting a circulation flow by taking the initial frame as a current frame; searching non-initial frames meeting a preset position condition with the current frame in each circulation flow, determining non-initial frames closest to the current frame in all the searched non-initial frames, and updating the non-initial frames closest to the current frame into the current frame to enter the next circulation flow; if a non-initial frame meeting the preset position condition with the current frame cannot be found, stopping the circulation flow;
generating a sentence granularity marking sentence frame according to the starting frame and a non-starting frame corresponding to the starting frame determined through the circulation flow, so as to generate a sentence granularity marking training sample corresponding to the word granularity marking training sample;
the sentence granularity annotation training sample generation method further comprises the following steps:
determining the initial frames in the same row in all the initial frames according to the coordinate information of each initial frame;
before the updating the non-initial frame closest to the current frame, the method further comprises:
judging whether a starting frame which is in the same row as the starting frame of the current frame when the circulation flow is started exists between the non-starting frame which is closest to the starting frame and the starting frame which is the current frame when the circulation flow is started;
if not, updating the non-initial frame closest to the current frame;
if so, stopping the current circulation flow.
2. The sentence granularity annotation training sample generation method of claim 1, wherein said determining a start frame and a non-start frame of all of said text frames based on said number of characters and said coordinate information comprises:
and determining a text box with the number of characters equal to a preset value as a starting box.
3. The sentence granularity annotation training sample generation method of claim 1, wherein said determining a start frame and a non-start frame of all of said text frames based on said number of characters and said coordinate information comprises:
and determining the text box without other text boxes in the left preset range as a starting box.
4. The sentence granularity annotation training sample generation method of claim 1, wherein the preset location conditions comprise: a first condition and a second condition;
the first condition includes: the left abscissa of the non-initial frame is larger than the average value of the left abscissa and the right abscissa of the current frame;
the second condition includes: the lower side height of the non-initial frame is smaller than or equal to the upper side height of the current frame, and/or the upper side height of the non-initial frame is larger than or equal to the lower side height of the current frame.
5. The sentence granularity annotation training sample generation method of claim 1, wherein generating sentence granularity annotation sentence frames from the start frame and non-start frames corresponding to the start frame determined via the loop flow comprises:
generating an external rectangle containing the initial frame and a non-initial frame corresponding to the initial frame determined by the circulation flow, determining coordinate information of the external rectangle, and generating the sentence granularity annotation sentence frame.
6. Sentence granularity annotation training sample generation device, characterized by comprising:
the character granularity labeling training sample acquisition unit is used for acquiring the respective coordinate information of each character frame and the number of characters in each character frame in the character granularity labeling training sample;
a start frame determining unit, configured to determine a start frame and a non-start frame in all the text frames according to the number of characters and the coordinate information;
a non-initial frame query unit, configured to start a circulation flow by using the initial frame as a current frame; searching non-initial frames meeting a preset position condition with the current frame in each circulation flow, determining non-initial frames closest to the current frame in all the searched non-initial frames, and updating the non-initial frames closest to the current frame into the current frame to enter the next circulation flow; if a non-initial frame meeting the preset position condition with the current frame cannot be found, stopping the circulation flow;
the sentence granularity labeling training sample generation unit is used for generating a sentence granularity labeling sentence frame according to the initial frame and a non-initial frame corresponding to the initial frame determined through the circulation flow so as to generate a sentence granularity labeling training sample corresponding to the word granularity labeling training sample;
the sentence granularity annotation training sample generation device further comprises:
the same-line detection unit is used for determining the initial frames in the same line in all the initial frames according to the coordinate information of each initial frame;
the non-initial frame query unit updates the non-initial frame closest to the current frame, specifically including:
the non-initial frame inquiry unit judges whether an initial frame which is in the same line as the initial frame which is used as the current frame when the circulation flow is started exists between the non-initial frame which is closest to the initial frame and the initial frame which is used as the current frame when the circulation flow is started, if the initial frame is not present, the non-initial frame which is closest to the initial frame is updated to the current frame, and if the initial frame is present, the current circulation flow is stopped.
7. The sentence granularity annotation training sample generation apparatus of claim 6, wherein the start box determination unit comprises:
and the first determining module is used for determining a text box with the number of characters equal to a preset value as a starting box.
8. The sentence granularity annotation training sample generation apparatus of claim 6, wherein the start box determination unit comprises:
and the second determining module is used for determining a text box which does not exist other text boxes in the left preset range as a starting box.
9. The sentence granularity annotation training sample generation apparatus of claim 6, wherein the preset location conditions comprise: a first condition and a second condition;
the first condition includes: the left abscissa of the non-initial frame is larger than the average value of the left abscissa and the right abscissa of the current frame;
the second condition includes: the lower side height of the non-initial frame is smaller than or equal to the upper side height of the current frame, and/or the upper side height of the non-initial frame is larger than or equal to the lower side height of the current frame.
10. The sentence granularity annotation training sample generation apparatus of claim 6, wherein the sentence granularity annotation training sample generation unit comprises:
and the sentence granularity labeling sentence frame generation module is used for generating an external rectangle containing the initial frame and the non-initial frame corresponding to the initial frame determined by the circulation flow, and determining coordinate information of the external rectangle so as to generate the sentence granularity labeling sentence frame.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 5 when executing the computer program.
12. A computer readable storage medium storing a computer program, characterized in that the computer program when executed in a computer processor implements the method of any one of claims 1 to 5.
CN202010551112.1A 2020-06-16 2020-06-16 Sentence granularity annotation training sample generation method and device Active CN111738326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010551112.1A CN111738326B (en) 2020-06-16 2020-06-16 Sentence granularity annotation training sample generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010551112.1A CN111738326B (en) 2020-06-16 2020-06-16 Sentence granularity annotation training sample generation method and device

Publications (2)

Publication Number Publication Date
CN111738326A CN111738326A (en) 2020-10-02
CN111738326B true CN111738326B (en) 2023-07-11

Family

ID=72649396

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010551112.1A Active CN111738326B (en) 2020-06-16 2020-06-16 Sentence granularity annotation training sample generation method and device

Country Status (1)

Country Link
CN (1) CN111738326B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778887A (en) * 2016-12-27 2017-05-31 努比亚技术有限公司 The terminal and method of sentence flag sequence are determined based on condition random field
CN110135417A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Sample mask method and computer storage medium

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942550B (en) * 2014-05-04 2018-11-02 厦门大学 A kind of scene text recognition methods based on sparse coding feature
CN105469047B (en) * 2015-11-23 2019-02-22 上海交通大学 Chinese detection method and system based on unsupervised learning deep learning network
CN106940799B (en) * 2016-01-05 2020-07-24 腾讯科技(深圳)有限公司 Text image processing method and device
CN105809164B (en) * 2016-03-11 2019-05-14 北京旷视科技有限公司 Character recognition method and device
CN106446899A (en) * 2016-09-22 2017-02-22 北京市商汤科技开发有限公司 Text detection method and device and text detection training method and device
CN107527056B (en) * 2017-09-01 2020-07-03 南京邮电大学 Character segmentation method based on coarse positioning of license plate
CN107798299B (en) * 2017-10-09 2020-02-07 平安科技(深圳)有限公司 Bill information identification method, electronic device and readable storage medium
CN107748888B (en) * 2017-10-13 2019-11-08 众安信息技术服务有限公司 A kind of image text row detection method and device
CN108304835B (en) * 2018-01-30 2019-12-06 百度在线网络技术(北京)有限公司 character detection method and device
CN110135426B (en) * 2018-02-09 2021-04-30 北京世纪好未来教育科技有限公司 Sample labeling method and computer storage medium
CN110135407B (en) * 2018-02-09 2021-01-29 北京世纪好未来教育科技有限公司 Sample labeling method and computer storage medium
CN108564084A (en) * 2018-05-08 2018-09-21 北京市商汤科技开发有限公司 character detecting method, device, terminal and storage medium
CN109582956B (en) * 2018-11-15 2022-11-11 中国人民解放军国防科技大学 Text representation method and device applied to sentence embedding
CN109657629B (en) * 2018-12-24 2021-12-07 科大讯飞股份有限公司 Text line extraction method and device
CN109685055B (en) * 2018-12-26 2021-11-12 北京金山数字娱乐科技有限公司 Method and device for detecting text area in image
CN110188751B (en) * 2019-05-20 2023-01-03 福建福清核电有限公司 M310 nuclear power unit equipment label position number image recognition method
CN110163285B (en) * 2019-05-23 2021-03-02 阳光保险集团股份有限公司 Ticket recognition training sample synthesis method and computer storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106778887A (en) * 2016-12-27 2017-05-31 努比亚技术有限公司 The terminal and method of sentence flag sequence are determined based on condition random field
CN110135417A (en) * 2018-02-09 2019-08-16 北京世纪好未来教育科技有限公司 Sample mask method and computer storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BCCWJ-DepPara: A Syntactic Annotation Treebank on the ‘Balanced Corpus of Contemporary Written Japanese’;Masayuki Asahara et al.;The COLING 2016 Organizing Committee;第49-58页 *
句类分析准则在作战文书地名识别中的应用;李颖等;计算机工程与设计;第34卷(第08期);第2903-2907页 *
面向文本的主题挖掘技术与实现;卢健;中国优秀硕士学位论文全文数据库 (信息科技辑)(第04期);第I138-1363页 *

Also Published As

Publication number Publication date
CN111738326A (en) 2020-10-02

Similar Documents

Publication Publication Date Title
CN111222305B (en) Information structuring method and device
US11580763B2 (en) Representative document hierarchy generation
CN110909725A (en) Method, device and equipment for recognizing text and storage medium
WO2021151270A1 (en) Method and apparatus for extracting structured data from image, and device and storage medium
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
RU2697649C1 (en) Methods and systems of document segmentation
CN110263792B (en) Image recognizing and reading and data processing method, intelligent pen, system and storage medium
CN110851641A (en) Cross-modal retrieval method and device and readable storage medium
CN114005012A (en) Training method, device, equipment and storage medium of multi-mode pre-training model
JP2018194919A (en) Learning program, learning method and learning device
CN113542865A (en) Video editing method, device and storage medium
CN110909123A (en) Data extraction method and device, terminal equipment and storage medium
CN112347997A (en) Test question detection and identification method and device, electronic equipment and medium
CN108133209B (en) Target area searching method and device in text recognition
CN112990142B (en) Video guide generation method, device and equipment based on OCR (optical character recognition), and storage medium
CN111737443B (en) Answer text processing method and device and key text determining method
CN114037007A (en) Data set construction method and device, computer equipment and storage medium
CN111738326B (en) Sentence granularity annotation training sample generation method and device
CN113780365A (en) Sample generation method and device
US20230050371A1 (en) Method and device for personalized search of visual media
CN111930976A (en) Presentation generation method, device, equipment and storage medium
CN111966836A (en) Knowledge graph vector representation method and device, computer equipment and storage medium
US20220027677A1 (en) Information processing device, information processing method, and storage medium
CN112069818A (en) Triple prediction model generation method, relation triple extraction method and device
CN111008295A (en) Page retrieval method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant