CN117829124A

CN117829124A - Text de-registration method and device, electronic equipment and storage medium

Info

Publication number: CN117829124A
Application number: CN202311872256.7A
Authority: CN
Inventors: 徐春光
Original assignee: Dongguan Bubugao Education Software Co ltd
Current assignee: Dongguan Bubugao Education Software Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-05

Abstract

The application relates to the technical field of teaching equipment, and provides a text de-coincidence method, a text de-coincidence device, electronic equipment and a storage medium. The method comprises the following steps: obtaining a result of de-overlapping of a currently scanned target text and a previously scanned text; and performing text de-overlapping and processing on the target text and the text of the previous scanning to obtain the text de-overlapping and result of the current scanning. By the arrangement, corresponding text de-coincidence and result can be obtained after each scanning, so that the effect of text de-coincidence is realized. In addition, because only text de-duplication and merging processing is performed, image stitching is not needed, the problems of long time consumption and high power consumption are not generated, and the requirement of acquiring scanned text in real time with low power consumption of electronic equipment can be met.

Description

Text de-registration method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the technical field of teaching devices, and in particular, to a text de-registration method, a device, an electronic device, and a storage medium.

Background

Electronic equipment such as dictionary pens with wide pen points can scan the contents of a plurality of lines of texts at a time, so that the user input efficiency is improved. However, the repeated content is likely to be scanned by two scans, and the direct splicing of the scanned text introduces a large amount of repeated content, which ultimately results in a decrease in text retrieval efficiency and accuracy. In view of this problem, the prior art generally splices a plurality of scanned images into one image, and then performs OCR recognition processing, so that text having no repeated content can be obtained. However, the problem of long time consumption and high power consumption caused by image stitching can not meet the requirement of acquiring the scanned text in real time with low power consumption of the electronic equipment.

Disclosure of Invention

In view of this, the embodiments of the present application provide a text de-registration method, apparatus, electronic device, and storage medium, which can de-duplicate text scanned by the electronic device, and do not generate problems of long time consumption and high power consumption.

A first aspect of an embodiment of the present application provides a text de-registration method, including:

obtaining a result of de-overlapping of a currently scanned target text and a previously scanned text;

and performing text de-overlapping and processing on the target text and the text of the previous scanning to obtain the text de-overlapping and result of the current scanning.

In the embodiment of the application, after the target text scanned at the present time is obtained, the target text is de-overlapped with the text scanned at the previous time and the text de-overlapped and combined processing is performed on the result, so that the text de-overlapped and the result scanned at the present time are obtained. By the arrangement, when the text scanned next time is obtained, the text de-coincidence and merging processing can be carried out on the text de-coincidence and result of the text scanned next time, so that the text de-coincidence and result of the next time can be obtained, namely, the corresponding text de-coincidence and result can be obtained after each time of scanning, and therefore the effect of text de-coincidence is achieved. In addition, the text de-duplication and merging processing is only performed in the process, and image splicing is not needed, so that the problems of long time consumption and high power consumption are not generated, and the requirement of acquiring the scanned text in real time with low power consumption of the electronic equipment can be met.

In one implementation manner of the embodiment of the present application, performing text de-registration and processing on the target text and the previously scanned text to obtain the currently scanned text de-registration and result, where the method includes:

sequentially de-registering the texts scanned in the previous time and selecting a current text line from the results;

detecting whether the current text line is repeated with any text line of the target text;

if any text line of the current text line and the target text line is repeated, selecting one text line from the current text line and the any text line, adding the selected text line to a duplicate removal result queue, and deleting the any text line in the target text;

if neither the current text line nor any text line of the target text is repeated, adding the current text line to a duplicate removal result queue;

if all text lines in the text de-overlapping and result of the previous scanning are sequentially selected, adding the rest text lines in the target text to a de-duplication result queue, determining the de-duplication result queue as the text de-overlapping and result of the current scanning, otherwise, returning to the step of sequentially selecting one current text line from the text de-overlapping and result of the previous scanning and the subsequent step.

In one implementation of the embodiment of the present application, detecting whether the current text line is repeated with any text line of the target text includes:

comparing the current text line with each text line of the target text respectively to obtain the text line coincidence of each text line of the target text and the current text line;

and if the text overlap ratio of any text line in the target text is higher than a set threshold, determining that the current text line is repeated with any text line.

In one implementation of the embodiment of the present application, let the target text line represent any text line of the target text; comparing the coincidence degree of the current text line with each text line of the target text to obtain the coincidence degree of each text line of the target text and the current text line, comprising:

dividing the current text line into a first character array and dividing the target text line into a second character array;

constructing a comparison matrix of the first character array and the second character array;

determining a common substring of the current text line and the target text line according to the comparison matrix;

and calculating the text coincidence degree of the target text line and the current text line according to the length of the public substring, the length of the first character array and the length of the second character array.

In one implementation manner of the embodiment of the present application, determining, according to the comparison matrix, a common substring of the current text line and the target text line includes:

and searching each element segment with continuous values from the comparison matrix as a common substring.

In one implementation manner of the embodiment of the present application, selecting a text line from the current text line and any text line, and adding the text line to a duplicate removal result queue includes:

if the lengths of the current text line and any text line are different, selecting a text line with a longer length from the current text line and any text line, and adding the text line to a duplicate removal result queue;

if the length of the current text line is the same as that of any text line, the current text line is added to the duplicate removal result queue.

In another implementation manner of the embodiment of the present application, performing text de-registration and processing on the target text and the previously scanned text to obtain the currently scanned text de-registration and result, where the text de-registration and result includes:

de-registering the text scanned in the previous time and adding the result to a de-registering result queue;

selecting a current text line from the target text in sequence;

detecting whether the current text line is de-overlapped with the text scanned in the previous time and repeating any text line of the result;

If the current text line is de-overlapped with the text scanned in the previous time and any text line of the result is repeated, deleting the current text line in the target text;

if all text lines in the target text are selected in sequence, adding the rest text lines in the target text to a duplicate removal result queue, determining the duplicate removal result queue as a duplicate removal result of the text scanned at the time, otherwise, returning to execute the step of selecting one current text line from the target text in sequence and the subsequent steps.

A second aspect of an embodiment of the present application provides a text de-registering and device, including:

the text acquisition module is used for acquiring the de-coincidence and result of the currently scanned target text and the previously scanned text;

and the text de-overlapping module is used for de-overlapping the target text and the text of the previous scanning and processing the text de-overlapping result to obtain the text de-overlapping result of the current scanning.

A third aspect of the embodiments of the present application provides an electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the text de-registration and method as provided in the first aspect of the embodiments of the present application when the computer program is executed.

A fourth aspect of the embodiments of the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, implements a text de-registration and method as provided by the first aspect of the embodiments of the present application.

A fifth aspect of the embodiments of the present application provides a computer program product, which when run on an electronic device, causes the electronic device to perform the text de-registration and method as provided by the first aspect of the embodiments of the present application.

It will be appreciated that the advantages of the second to fifth aspects may be found in the relevant description of the first aspect, and are not described here again.

Drawings

FIG. 1 is a flow chart of a text de-registering and method provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an overall operation flow of the text de-registration and method provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of an operation flow for calculating text overlap ratio according to an embodiment of the present application;

FIG. 4 is a schematic operation flow diagram of a core text deduplication merging process according to an embodiment of the present application;

FIG. 5 is a structural frame diagram of a text de-registering and device provided in an embodiment of the present application;

Fig. 6 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail. In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

At present, electronic devices such as partial dictionary pens on the market adopt a design with a wide pen point, and the electronic devices can scan contents of a plurality of lines of texts at one time, so that the input efficiency of a user is improved, but the problem that the text retrieval efficiency and the accuracy are affected due to the fact that repeated contents are scanned also exists. In the prior art, a plurality of scanned images are spliced into one image, and then OCR recognition processing is performed, so that a text without repeated contents is obtained. However, the problem of long time consumption and high power consumption caused by image stitching can not meet the requirement of acquiring the scanned text in real time with low power consumption of the electronic equipment.

Aiming at the problems, the embodiment of the application provides a text de-coincidence method, a device, electronic equipment and a storage medium, which can de-coincidence the text scanned by the electronic equipment, and can not generate the problems of long time consumption and high power consumption. For more specific technical implementation details of embodiments of the present application, please refer to various embodiments described below.

It should be understood that the execution body of the method embodiments of the present application is various types of electronic devices, for example, a dictionary pen, a translation pen, a learning machine, a home teaching machine, a tablet computer, a mobile phone, a wearable device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, a netbook, a personal digital assistant (personal digital assistant, PDA), and the like, and the specific type of the electronic device is not limited in the embodiments of the present application.

Referring to fig. 1, a text de-registration method provided in an embodiment of the present application includes:

101. obtaining a result of de-overlapping of a currently scanned target text and a previously scanned text;

when a user reads a text page, electronic equipment such as a dictionary pen can be used for scanning the content of the text page, so that corresponding translation or interpretation can be obtained. Under a scene of multi-line scanning, a user needs to scan for a plurality of times, the electronic equipment splices texts obtained by each scanning, and then searches are completed based on the spliced texts. After the user finishes the current scanning operation, the electronic device can acquire the de-overlapping and result of the current scanned target text and the previously scanned text.

102. And performing text de-overlapping and processing on the target text and the text of the previous scanning to obtain the text de-overlapping and result of the current scanning.

After the electronic equipment acquires the de-overlapping and result of the target text of the current scanning and the text of the previous scanning, the de-overlapping and result of the target text and the text of the previous scanning is subjected to text de-overlapping and combination processing, so that the de-overlapping and result of the text of the current scanning is obtained, namely, the corresponding text de-overlapping and result can be obtained after each scanning, and therefore the effect of text de-overlapping is realized.

Specifically, after the electronic device completes the first scanning, the text scanned for the first time is saved; after finishing the second scanning, the electronic equipment performs text de-overlapping and merging processing on the text scanned for the second time and the text scanned for the first time to obtain a text de-overlapping and merging result of the second scanning; after finishing the third scanning, the electronic equipment de-overlaps the text of the third scanning with the text of the second scanning and processes the text de-overlapping result to obtain the text de-overlapping result of the third scanning; after the fourth scanning is completed, the electronic device performs text de-overlapping and processing on the text of the fourth scanning and the text of the third scanning, so as to obtain the text de-overlapping and result of the fourth scanning, and the like.

As shown in fig. 2, an overall operation flow diagram of the text de-registration method according to the embodiment of the present application is shown. For fig. 2, the scanned OCR text may be stored in a scan result list, and each time a group of OCR texts is selected from the list to be de-overlapped with the previous text and the result is processed to be de-overlapped, until all OCR texts included in the list are processed, so as to obtain a final OCR text de-overlapped and result.

(1) Sequentially de-registering the texts scanned in the previous time and selecting a current text line from the results;

(2) Detecting whether the current text line is repeated with any text line of the target text;

(3) If any text line of the current text line and the target text line is repeated, selecting one text line from the current text line and the any text line, adding the selected text line to a duplicate removal result queue, and deleting the any text line in the target text;

(4) If neither the current text line nor any text line of the target text is repeated, adding the current text line to a duplicate removal result queue;

(5) If all text lines in the text de-overlapping and result of the previous scanning are sequentially selected, adding the rest text lines in the target text to a de-duplication result queue, determining the de-duplication result queue as the text de-overlapping and result of the current scanning, otherwise, returning to the step (1).

When the target text is de-overlapped with the text scanned in the previous time and the text is de-overlapped and combined as a result, a current text line is firstly selected from the text scanned in the previous time and the text is de-overlapped and combined as a result in sequence. Then, whether the current text line is repeated with any text line of the target text is detected, and whether the text line is repeated can be determined in a text comparison mode. Then, the method is divided into two cases, wherein one case is that any text line of the current text line and the target text is repeated, at this time, the repeated processing is needed, specifically, one text line can be selected from the current text line and any text line according to a certain mode, added into a repeated result queue, and the any text line in the target text is deleted. And secondly, the current text line and any text line of the target text are not repeated, and at the moment, the current text line is directly added to a duplicate removal result queue without duplicate removal processing. Next, judging whether the text of the previous scanning is sequentially selected to be de-overlapped and all text lines in the result are all selected, if not, returning to the step (1), namely sequentially selecting the text of the previous scanning to be de-overlapped and the next text line in the result to be used as a new current text line, and then executing the same operation. If so, the remaining text lines in the target text are added to the deduplication result queue, which is then determined as the result of the de-registration of the text being scanned at the time.

To facilitate understanding of the above process, a practical example is listed below. Assuming that the previously scanned text is de-registered and results in OCR result 1, the target text is OCR result 2, where OCR result 1 includes text line 1, text line 2 and text line 3, OCR result 2 includes text line 4 and text line 5, where text line 3 and text line 4 are repeated, and none of the other text lines are repeated. Text line 1 is selected from OCR result 1 in order, and it is detected whether text line 1 is repeated with any text line of OCR result 2 (text line 4 or text line 5), and since neither text line 1 nor any text line of OCR result 2 is repeated, text line 1 is added to the deduplication result queue, which is "text line 1" at this time. Since all text lines of OCR result 1 have not been selected, text line 2 continues to be selected from OCR result 1, and it is detected whether text line 2 is repeated with any text line of OCR result 2, and since text line 2 is not repeated with any text line of OCR result 2, text line 2 is added to the deduplication result queue, and at this time, the deduplication result queue is "text line 1, text line 2". Next, selecting text line 3 from OCR result 1, detecting whether text line 3 is repeated with any text line of OCR result 2, and finding that text line 3 is repeated with text line 4, selecting one text line from text line 3 and text line 4 according to a set manner, adding text line 3 to the duplicate result queue if text line 3 is selected, and deleting text line 4 from OCR result 2. Thereafter, since all text lines of OCR result 1 have been selected, the remaining text lines in OCR result 2, i.e. text line 5, are added to the deduplication result queue, which is "text line 1, text line 2, text line 3, text line 5". That is, the final text de-overlap and the result is "text line 1, text line 2, text line 3, text line 5", it can be seen that duplicate text line 4 has been removed at the time of text merging, so that the effect of text de-overlap merging is achieved.

(1) Comparing the current text line with each text line of the target text respectively to obtain the text line coincidence of each text line of the target text and the current text line;

(2) And if the text overlap ratio of any text line in the target text is higher than a set threshold, determining that the current text line is repeated with any text line.

When detecting whether the current text line is repeated with any text line of the target text, the current text line can be respectively compared with each text line of the target text, so that the text line of the target text is obtained, and if the text line of any text line in the target text is higher than a set threshold (for example, 80%), the current text line is determined to be repeated with any text line. If the text overlap ratio of all text lines of the target text and the current text line is smaller than the set threshold value, it can be determined that neither text line of the current text line nor any text line of the target text is repeated. For example, assuming that the target text includes text line 1, text line 2, and text line 3, the current text line is aligned with text line 1 to obtain text line 1 and current text line, the current text line is aligned with text line 2 to obtain text line 2 and current text line 2, and the current text line is aligned with text line 3 to obtain text line 3 and current text line 3. Assuming that text overlap 2 is greater than 80%, it may be determined that the current text line is repeated with text line 2. If the text overlap ratio 1, the text overlap ratio 2 and the text overlap ratio 3 are all smaller than 80%, it can be determined that none of the current text lines and any of the text lines of the target text are repeated.

(1) Dividing the current text line into a first character array and dividing the target text line into a second character array;

(2) Constructing a comparison matrix of the first character array and the second character array;

(3) Determining a common substring of the current text line and the target text line according to the comparison matrix;

(4) And calculating the text coincidence degree of the target text line and the current text line according to the length of the public substring, the length of the first character array and the length of the second character array.

Assuming that the target text line represents any text line of the target text, when calculating the text overlap ratio of the target text line and the current text line, the current text line may be segmented into a first character array, and the target text line may be segmented into a second character array. For example, assuming that the current text line is S1 and the target text line is S2, S1 may be segmented into character arrays L1 and S2 may be segmented into character arrays L2 according to utf-8 encoding or the like. Then, a comparison matrix of the first character array and the second character array is constructed, wherein the comparison matrix of the two character arrays can be constructed according to a dynamic programming idea, and the comparison matrix is also called dp matrix, so that the problem of Chinese and English format representation of punctuation marks can be ignored in the comparison process, for example, the commas of Chinese and English can be considered as the same character, so that the situation that OCR recognition is easy to make mistakes is compatible with punctuation marks, and the success rate of text duplication removal is improved. After the alignment matrix is constructed, common substrings of the current text line and the target text line may be determined from the elements contained in the alignment matrix. Finally, according to the length of the public substring, the length of the first character array and the length of the second character array, the text coincidence degree of the target text line and the current text line can be calculated.

When the common substring is determined according to the comparison matrix, all elements of the comparison matrix can be traversed, and each element segment with continuous values can be found out to serve as the common substring. Wherein each found element segment with consecutive values can be regarded as a common substring.

For example, assuming s1= "e it to school and play it with my friends. It' S", s2= "school and play it", the constructed alignment matrix is:

[-1,-1,-1,-1,-1,-1,-1,-1,0,1,2,3,4,5,6,7,8,9,10,11,12,13,-1,-1,-1,

-1,-1,-1,-1,-1,-1,-1,-1,14,15,-1,-1,16,-1,-1,-1,-1,-1,-1,-1,17,-1,-1]

in the alignment matrix, the element segment [0,1,2,3,4,5,6,7,8,9,10,11,12,13] and the element segment [14,15] with continuous values can be found, and are 2 common substrings.

After determining the common substring, the sum of the lengths of all the common substrings may be calculated, and then the result divided by the smaller of the length of the first character array and the length of the second character array, thereby obtaining the text overlap ratio of the target text line and the current text line.

Fig. 3 is a schematic diagram of an operation flow for calculating text overlap ratio according to an embodiment of the present application. In fig. 3, the text S1 is first segmented into a character array L1, the text S2 is segmented into a character array L2, then a comparison matrix of L1 and L2 is obtained through a dynamic programming algorithm, and all the common substrings of the text S1 and the text S2 are obtained according to the comparison matrix. Finally, the sum of the lengths of all the public substrings is calculated, and the result is divided by the smaller length of L1 and L2, so that the text overlap ratio of the text S1 and the text S2 can be obtained.

(1) If the lengths of the current text line and any text line are different, selecting a text line with a longer length from the current text line and any text line, and adding the text line to a duplicate removal result queue;

(2) If the length of the current text line is the same as that of any text line, the current text line is added to the duplicate removal result queue.

The method belongs to a line selection retaining logic for finding repeated text lines, which is provided by the embodiment of the application, firstly detects the length of a current text line and the length of any text line, and if the lengths of two text lines are different, selects a text line with a longer length for retaining, namely adds the text line with the longer length to a duplicate removal result queue, so that more text information can be retained, and the accuracy of subsequent text retrieval is improved. If the lengths of the two text lines are the same, the text line in the previous scanning is selected to be de-overlapped and the text line in the result is obtained, namely the current text line is reserved and is added to a de-duplication result queue.

In the above description, the core text de-duplication merging process provided in the embodiment of the present application has an operation flow schematic shown in fig. 4. In fig. 4, OCR results 1 and OCR results 2 are first acquired, where OCR results 1 represents the text of the previous scan being de-registered and results, and OCR results 2 represents the target text of the current scan. And sequentially taking a line L from the OCR result 1, comparing the line L with all text lines of the OCR result 2 in a coincidence degree, deleting any text line of the OCR result 2 if the coincidence degree of the text corresponding to the any text line of the OCR result 2 is larger than a set threshold value, selecting one text line from the line L and the any text line according to a line selection retaining logic, and adding the text line into a duplicate removal result queue. If the text overlap ratio corresponding to all text lines in OCR result 2 is smaller than the set threshold, adding L into the duplicate removal result queue. Next, it is determined whether there are unmatched text lines in OCR result 1, and if so, a step of sequentially taking one line L from OCR result 1 is returned, taking care that the next line of OCR result 1 is selected at this time. If all text lines of OCR result 1 have completed matching, then the remaining text lines in OCR result 2 (i.e., the text lines that did not match successfully) are added to the deduplication result queue. And outputting the final de-duplication result queue as the de-duplication result of the text scanned at the present time.

(1) De-registering the text scanned in the previous time and adding the result to a de-registering result queue;

(2) Selecting a current text line from the target text in sequence;

(3) Detecting whether the current text line is de-overlapped with the text scanned in the previous time and repeating any text line of the result;

(4) If the current text line is de-overlapped with the text scanned in the previous time and any text line of the result is repeated, deleting the current text line in the target text;

(5) If all text lines in the target text are selected in sequence, adding the rest text lines in the target text to a duplicate removal result queue, determining the duplicate removal result queue as a duplicate removal result of the text scanned at the time, and otherwise, returning to the step (2).

This pertains to another text de-registration and flow, first adding the previously scanned text de-registration and results to the de-duplication result queue, then sequentially selecting a current text line from the target text, detecting whether the current text line overlaps with any text line of the previously scanned text de-registration and results, where the text line overlap can also be determined by text comparison. Then, the method is divided into two cases, wherein in the first case, the current text line is de-overlapped with the text scanned in the previous time and any text line of the result is repeated, and at the moment, de-duplication processing is needed, and in particular, the current text line in the target text can be deleted. And in the second case, the current text line is de-overlapped with the text scanned in the previous time and any text line of the result is not repeated, and no de-duplication processing is needed, namely the current text line in the target text is reserved. Next, it is determined whether all text lines in the target text have been selected in order, and if not, step (2) is performed again, that is, the next text line in the target text is selected in order as a new current text line, and then the same operation is performed. If so, the remaining text lines in the target text are added to the deduplication result queue, which is then determined as the result of the de-registration of the text being scanned at the time.

To facilitate understanding of the above process, a practical example is listed below. Assuming that the previously scanned text is de-registered and results in OCR result 1, the target text is OCR result 2, where OCR result 1 includes text line 1, text line 2 and text line 3, OCR result 2 includes text line 4 and text line 5, where text line 3 and text line 4 are repeated, and none of the other text lines are repeated. OCR result 1 is first added to the deduplication result queue, which is "text line 1, text line 2, text line 3". Then, text line 4 is selected from OCR result 2 in order, and it is detected whether text line 4 is repeated with any text line of OCR result 1 (text line 1, text line 2 or text line 3), at which point it is found that text line 4 and text line 3 are repeated, thus deleting text line 4 in OCR result 2. Since all text lines of OCR result 2 have not been selected, text line 5 continues to be selected from OCR result 2, and it is detected whether text line 5 is repeated with any text line of OCR result 1, and text line 5 in OCR result 2 remains since text line 5 is not repeated with any text line of OCR result 1. Thereafter, since all text lines of OCR result 2 have been selected, the remaining text lines in OCR result 2, i.e. text line 5, are added to the deduplication result queue, which is "text line 1, text line 2, text line 3, text line 5". That is, the final text de-overlap and the result is "text line 1, text line 2, text line 3, text line 5", it can be seen that duplicate text line 4 has been removed at the time of text merging, so that the effect of text de-overlap merging is achieved.

In summary, an objective of the embodiments of the present application is to provide a text de-registration and scheme with higher running speed and lower power consumption, which can solve the problem of text repetition of electronic devices such as dictionary pens in a multi-line scanning scene.

It should be understood that the sequence numbers of the steps in the foregoing embodiments do not mean the order of execution, and the execution order of the processes should be determined by the functions and the internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

A text de-registering and method is mainly described above, and a text de-registering and apparatus will be described below.

Referring to fig. 5, an embodiment of a text de-registering and device according to an embodiment of the present application includes:

a text obtaining module 501, configured to obtain a result of de-overlapping a currently scanned target text and a previously scanned text;

and the text de-registration module 502 is configured to de-register the target text with the text previously scanned and process the result to obtain the text de-registration result of the current scan.

In one implementation manner of the embodiment of the present application, the text de-registration and merging module includes:

a first text line selecting unit, configured to sequentially de-overlap the text scanned from the previous time and select a current text line from the results;

the first text repetition detection unit is used for detecting whether the current text line is repeated with any text line of the target text;

a second text line selecting unit, configured to select a text line from the current text line and any text line if the current text line is repeated with any text line of the target text, add the selected text line to the duplicate removal result queue, and delete the any text line in the target text;

The first text line adding unit is used for adding the current text line to the duplicate removal result queue if any text line of the current text line and the target text is not repeated;

and the second text line adding unit is used for adding the rest text lines in the target text to the duplicate removal result queue if all text lines in the duplicate removal result of the text scanned in the previous time have been selected in sequence, determining the duplicate removal result queue as the duplicate removal result of the text scanned in the current time, and otherwise, returning to the step of selecting one current text line from the duplicate removal result of the text scanned in the previous time in sequence and the subsequent step.

In one implementation manner of the embodiment of the present application, the first text repetition detection unit includes:

the text overlap ratio calculating subunit is used for respectively comparing the current text line with each text line of the target text to obtain the text overlap ratio of each text line of the target text and the current text line;

and the text repetition determination subunit is used for determining that the current text line is repeated with any text line if the text overlap ratio of any text line in the target text is higher than a set threshold value.

In one implementation of the embodiment of the present application, let the target text line represent any text line of the target text; the text overlap ratio calculating subunit includes:

The text line segmentation sub-unit is used for segmenting the current text line into a first character array and segmenting the target text line into a second character array;

the comparison matrix construction subunit is used for constructing a comparison matrix of the first character array and the second character array;

a common substring determining subunit, configured to determine a common substring of the current text line and the target text line according to the comparison matrix;

the text overlap ratio determining subunit is configured to calculate, according to the length of the common substring, the length of the first character array, and the length of the second character array, a text overlap ratio between the target text line and the current text line.

In one implementation of the embodiment of the present application, the common substring determination subunit includes:

and the public substring searching subunit is used for searching each element segment with continuous values from the comparison matrix to serve as a public substring.

In one implementation manner of the embodiment of the present application, the second text line selecting unit includes:

the first selecting subunit is configured to select a text line with a longer length from the current text line and any text line if the lengths of the current text line and any text line are different, and add the text line to the duplicate removal result queue;

And the second selecting subunit is used for adding the current text line to the duplicate-removal result queue if the length of the current text line is the same as that of any text line.

In another implementation manner of the embodiment of the present application, the text de-registration and merging module includes:

a third text line adding unit, configured to de-overlap the text scanned previously and add the result to the de-duplication result queue;

a third text line selecting unit, configured to sequentially select a current text line from the target text;

a second text repetition detection unit, configured to detect whether the current text line overlaps with the previously scanned text and any text line that is a result of the previous scanning is repeated;

a text line deleting unit, configured to delete a current text line in the target text if the current text line is de-overlapped from a previously scanned text and any text line that is a result is repeated;

and a fourth text line adding unit, configured to add the remaining text lines in the target text to the duplicate removal result queue if all text lines in the target text have been selected in order, determine the duplicate removal result queue as a duplicate removal result of the text scanned at the time, and otherwise, return to execute the step of sequentially selecting one current text line from the target text and the subsequent steps.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a text de-registration and method as described in any of the embodiments above.

Embodiments also provide a computer program product which, when run on an electronic device, causes the electronic device to perform the text de-registration and method as described in any of the embodiments above.

Fig. 6 is a schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device 6 of this embodiment includes: a processor 60, a memory 61 and a computer program 62 stored in said memory 61 and executable on said processor 60. The processor 60, when executing the computer program 62, performs the steps of the various embodiments of text de-registering and method described above, such as steps 101 through 102 shown in fig. 1. Alternatively, the processor 60, when executing the computer program 62, performs the functions of the modules/units of the apparatus embodiments described above, such as the functions of the modules 501 to 502 shown in fig. 5.

The computer program 62 may be divided into one or more modules/units, which are stored in the memory 61 and executed by the processor 60 to complete the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used to describe the execution of the computer program 62 in the electronic device 6.

The processor 60 may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 61 may be an internal storage unit of the electronic device 6, such as a hard disk or a memory of the electronic device 6. The memory 61 may be an external storage device of the electronic device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 6. Further, the memory 61 may also include both an internal storage unit and an external storage device of the electronic device 6. The memory 61 is used for storing the computer program and other programs and data required by the electronic device. The memory 61 may also be used for temporarily storing data that has been output or is to be output.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purposes of the embodiments of the present application.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each method embodiment described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium may include content that is subject to appropriate increases and decreases as required by jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is not included as electrical carrier signals and telecommunication signals.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A text de-registering and method comprising:

and performing text de-overlapping treatment on the target text and the text de-overlapping result of the previous scanning to obtain the text de-overlapping result of the current scanning.

2. The method of claim 1, wherein said performing text de-registration and processing of said target text de-registration and results from said previously scanned text to obtain text de-registration and results from a current scan comprises:

if the current text line is repeated with any text line of the target text, selecting one text line from the current text line and any text line, adding the selected text line to a duplicate removal result queue, and deleting any text line in the target text;

if neither the current text line nor any text line of the target text is repeated, adding the current text line to the duplicate removal result queue;

if all text lines in the previous scanned text de-registration and result are sequentially selected, adding the rest text lines in the target text to the de-duplication result queue, determining the de-duplication result queue as the current scanned text de-registration and result, otherwise, returning to execute the step of sequentially selecting a current text line from the previous scanned text de-registration and result and the subsequent steps.

3. The method of claim 2, wherein the detecting whether the current text line is repeated with any text line of the target text comprises:

4. The method of claim 3, wherein a target text line is caused to represent any text line of the target text; the step of comparing the current text line with each text line of the target text to obtain the text line overlap ratio of each text line of the target text and the current text line, includes:

and calculating the text overlap ratio of the target text line and the current text line according to the length of the public substring, the length of the first character array and the length of the second character array.

5. The method of claim 4, wherein the determining the common substring of the current text line and the target text line from the alignment matrix comprises:

And searching each element segment with continuous numerical values from the comparison matrix to serve as the common substring.

6. The method according to any one of claims 2 to 5, wherein selecting a text line from the current text line and any one of the text lines, adding to a deduplication result queue, comprises:

if the lengths of the current text line and any text line are different, selecting a text line with a longer length from the current text line and any text line, and adding the text line to the duplicate removal result queue;

and if the lengths of the current text line and any text line are the same, adding the current text line to the duplicate removal result queue.

7. The method of claim 1, wherein said performing text de-registration and processing of said target text de-registration and results from said previously scanned text to obtain text de-registration and results from a current scan comprises:

selecting a current text line from the target text in sequence;

detecting whether the current text line is de-overlapped with the text scanned in the previous time and any text line of the result is repeated;

Deleting the current text line in the target text if the current text line is de-overlapped with the text scanned in the previous time and any text line of the result is repeated;

if all text lines in the target text are selected in sequence, adding the rest text lines in the target text to the duplicate removal result queue, determining the duplicate removal result queue as the duplicate removal result of the text scanned at the present time, and otherwise, returning to the step of selecting one current text line from the target text in sequence and the subsequent steps.

8. A text de-registering and device, comprising:

and the text de-overlapping module is used for performing text de-overlapping and processing on the target text and the text de-overlapping and result of the previous scanning to obtain the text de-overlapping and result of the current scanning.

9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the text de-registering and method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the text de-registration and method of any one of claims 1 to 7.