CN111310418A

CN111310418A - Text extraction method and device

Info

Publication number: CN111310418A
Application number: CN202010115263.2A
Authority: CN
Inventors: 刘均; 周辉濂
Original assignee: Shenzhen Launch Technology Co Ltd
Current assignee: Shenzhen Launch Technology Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-19

Abstract

The embodiment of the application provides a text extraction method and a text extraction device, and the method comprises the following steps: acquiring a document, wherein the document comprises a plurality of texts to be extracted and other texts except the plurality of texts to be extracted which are distributed in lines and columns of the document; adding M space characters before a plurality of texts to be extracted, wherein M is larger than or equal to K, K represents the maximum column number of the document contents, and K is a fixed value; deleting all text contents in M columns starting from the first column in the document so as to enable each text to be extracted to be left with space characters; adding N space characters after a plurality of texts to be extracted, wherein N is more than or equal to K; deleting all text contents in the N columns from the K +1 th column in the document so that space characters are left after each text to be extracted; all space characters in the document are deleted to obtain a document including only a plurality of texts to be extracted. By implementing the embodiment of the application, the operation is simple, the extraction efficiency of the text to be extracted can be improved, and the applicability is strong.

Description

Text extraction method and device

Technical Field

The present application relates to the field of computers, and in particular, to a text extraction method and apparatus.

Background

In daily work and life, text extraction is often involved, for example, in ODX and OTX development, some different texts or variables in XML files such as ODX and OTX need to be extracted, and macro definition or judgment is performed on the extracted texts or variables; for another example, in daily office work, it is necessary to extract some text from office files and save the extracted text in a document for editing or viewing, etc.

At present, a commonly used method for extracting texts is to find out all texts to be extracted from an original document, and then store the texts to be extracted in another document in a copying and pasting manner to realize the extraction of the texts, but the method needs to manually find out all the texts to be extracted in the document one by one, and then copy and paste the texts one by one into a new document, and the steps are complicated, time-consuming and labor-consuming; still another method for extracting texts is implemented in a programming manner, and a program developer extracts all texts to be extracted from an original document through programming, and then edits the extracted texts, but the method needs a certain professional skill and is not suitable for general workers.

Disclosure of Invention

The embodiment of the application provides a text extraction method and a text extraction device, which can easily extract a text to be extracted from an original document, are simple to operate, can improve the efficiency of extracting the text to be extracted, and have high applicability.

In a first aspect, an embodiment of the present application provides a text extraction method, where the method includes: acquiring a document, wherein the document comprises a plurality of texts to be extracted distributed in lines and columns of the document and other texts except the plurality of texts to be extracted; the text to be extracted comprises variable labels and variables, columns in the document extend infinitely, and at most one text to be extracted exists in each row in the document; adding M space characters before the texts to be extracted, wherein M is more than or equal to K, the K represents the maximum column number of the document contents, and the K is a fixed value; deleting all text contents in M columns starting from the first column in the document so as to enable each text to be extracted to be left with space characters; adding N space characters after the texts to be extracted, wherein N is more than or equal to K; deleting all text contents in N columns starting from the K +1 th column in the document, so that space characters are left after each text to be extracted; deleting all space characters in the document to obtain the document only comprising the plurality of texts to be extracted.

It can be seen that, firstly, a document including a plurality of texts to be extracted and a plurality of texts not to be extracted is obtained, wherein the texts not to be extracted are other texts except the texts to be extracted, the texts to be extracted include variable labels and variables, then, M space characters are added in front of the texts to be extracted, M is larger than or equal to the maximum column number K of the document content, so that the number of columns where the texts to be extracted are located is larger than the number of columns where the texts not to be extracted are located in a row of the texts not to be extracted and is larger than the number of columns where the texts not to be extracted are located in a row of the texts to be extracted, secondly, the text contents of the M columns in the document are deleted from the first column, so that only space characters remain in front of the texts to be extracted, then, N space characters are added behind the texts to be extracted, N is larger than or equal to the maximum column number K of the document content, so that the columns where the texts to be extracted are all larger than the columns where, and then, starting from the K +1 th column in the document, deleting the text contents of the N columns, so that only space characters are left after the plurality of texts to be extracted, and finally, deleting all the space characters in the document to obtain the document only comprising the plurality of texts to be extracted.

According to the method, M space characters are added before the text to be extracted, so that a plurality of space characters exist between the text to be extracted and the preceding text which is not to be extracted, and then the text which is not to be extracted and is positioned on the row before the row where the text to be extracted is positioned is deleted, so that only the space characters are left in front of the text to be extracted; similarly, N space characters are added after the text to be extracted, so that a plurality of space characters exist between the text to be extracted and the following text which is not to be extracted, and then the text which is not to be extracted and is positioned on the column behind the column where the text to be extracted is positioned is deleted, so that only the space characters are left behind the text to be extracted. Therefore, the embodiment of the application is implemented, tedious pasting and copying are not needed, professional programming is not needed, the extraction efficiency can be improved, the operation is simple, and the method has high applicability.

Based on the first aspect, in a possible implementation manner, after the obtaining the document including only the plurality of texts to be extracted, the method further includes: deleting the variable tags in the texts to be extracted to obtain the document only comprising a plurality of variables.

It can be understood that the text to be extracted includes the variable tags and the variables, and after the document including only a plurality of texts to be extracted is obtained, the document including only a plurality of variables can be obtained by deleting the variable tags in the plurality of texts to be extracted. For example, if the text to be extracted is "score-98", the variable label is "score-98", and the variable is "98", the variable label may be replaced by "delete variable label" score "by looking up" score-98 ", and certainly, the variable labels in the text to be extracted may be deleted in other manners.

Based on the first aspect, in a possible implementation manner, after the obtaining the document including only the plurality of texts to be extracted, the method further includes: and deleting repeated items in the texts to be extracted to obtain a document comprising a plurality of different texts to be extracted.

It can be understood that, in the obtained document including only a plurality of texts to be extracted, there may be repeated items in the plurality of texts to be extracted, and the document including a plurality of different texts to be extracted can be obtained by deleting the repeated items in the plurality of texts to be extracted.

Based on the first aspect, in a possible implementation, the method further includes: and editing the plurality of variables in the document comprising the plurality of variables to realize the extension operation of each variable.

It can be understood that the document includes only a plurality of variables, and the extension of each variable can be realized by editing each variable, for example, by extracting the variables in the XML document to obtain a document including only a plurality of variables, macro definition can be performed on the variables in the document, and the variables can also be judged. For convenience in operation, when each variable is edited, a plurality of variables can be copied and pasted into an excel tool, then columns are added in front of the variables, and contents are edited in the columns to realize the expansion of the variables.

Based on the first aspect, in a possible implementation manner, before adding M space characters before the plurality of texts to be extracted, the method further includes: receiving a first instruction; the first instruction is to indicate a first number M; before adding N space characters after the plurality of texts to be extracted, the method further comprises: receiving a second instruction; the second instruction is for indicating a second number N.

It can be understood that before a plurality of texts to be extracted are preceded by space characters, a first instruction of a user needs to be received, and the first instruction is used for indicating the number of the space characters to be added, namely a first number M; before adding the space characters after a plurality of texts to be extracted, a second instruction of the user needs to be received, wherein the second instruction is used for indicating the number of the space characters to be added, namely the second number N, so that the flexibility and the pertinence are realized in the text extraction process for different texts.

In a second aspect, an embodiment of the present application provides a text extraction apparatus, where the apparatus includes:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a document, and the document comprises a plurality of texts to be extracted distributed in lines and rows of the document and other texts except the plurality of texts to be extracted; the text to be extracted comprises variable labels and variables, columns in the document extend infinitely, and at most one text to be extracted exists in each row in the document;

the adding unit is used for adding M space characters before the texts to be extracted, wherein M is larger than or equal to K, K represents the maximum column number of the document contents, and K is a fixed value;

the deleting unit is used for deleting all text contents in M columns starting from the first column in the document so that only space characters are left before each text to be extracted;

the adding unit is further used for adding N space characters after the texts to be extracted, wherein N is larger than or equal to K;

the deleting unit is further configured to delete all text contents in N columns starting from the K +1 th column in the document, so that each text to be extracted leaves a space character;

the deleting unit is further configured to delete all space characters in the document to obtain a document including only the plurality of texts to be extracted.

Based on the second aspect, in a possible implementation manner, after the obtaining of the document including only the plurality of texts to be extracted, the deleting unit is further configured to: deleting the variable tags in the texts to be extracted to obtain the document only comprising a plurality of variables.

Based on the second aspect, in a possible implementation manner, after the obtaining of the document including only the plurality of texts to be extracted, the deleting unit is further configured to: and deleting repeated items in the texts to be extracted to obtain a document comprising a plurality of different texts to be extracted.

Based on the second aspect, in a possible embodiment, the apparatus further comprises an extension unit; the extension unit is used for editing each text to be extracted in the document comprising the different texts to be extracted to realize the extension operation of each text to be extracted.

Based on the second aspect, in a possible implementation manner, before adding M space characters before the plurality of texts to be extracted, the deleting unit is further configured to receive a first instruction; the first instruction is to indicate a first number M; before adding N space characters after the plurality of texts to be extracted, the deleting unit is further configured to receive a second instruction; the second instruction is for indicating a second number N.

Each functional module in the apparatus provided in the embodiment of the present application is specifically configured to implement the method described in the first aspect.

In a third aspect, an embodiment of the present application provides a text extraction apparatus, including a processor, a communication interface, and a memory; the memory is used for storing instructions, the processor is used for executing the instructions, and the communication interface is used for receiving or sending document or text information; wherein the processor executes the instructions to perform the method as described in the first aspect or any specific implementation manner of the first aspect.

In a fourth aspect, embodiments of the present application provide a non-volatile storage medium for storing program instructions, which, when applied to a text extraction apparatus, can be used to implement the method described in the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes program instructions, and when the computer program product is executed by a text extraction apparatus, the apparatus executes the method of the first aspect. The computer program product may be a software installation package, which may be downloaded and executed on a text extraction device to implement the method of the first aspect in case it is desired to use the method provided by any of the possible designs of the first aspect described above.

It can be seen that, in the present application, after a document is obtained, first, M space characters (M is greater than or equal to the maximum column number K of document contents) are added before a text to be extracted, so that a column in which the text to be extracted is located both after a column in which a text not to be extracted is located in a text line not to be extracted and after a column in which a text not to be extracted is located in a text line to be extracted is located before the text to be extracted, then, M columns in the document are deleted from the first column, so that only space characters remain before the text to be extracted, similarly, N space characters (N is greater than or equal to the maximum column number K of document contents) are added again after the text to be extracted, so that a column in which a text not to be extracted is located after the maximum column in which a text to be extracted is located, then, all text contents in N columns in the document are deleted from the K +1 column, so that only space characters remain behind the text to be extracted, finally, a document including only the text to be extracted can be obtained by deleting the space characters located before and after the text to be extracted. Certainly, if repeated items exist in a plurality of texts to be extracted, documents comprising different texts to be extracted can be obtained by deleting the repeated items; the text to be extracted comprises variable tags and variables, and the variable tags in a plurality of texts to be extracted can be deleted to obtain a document only comprising the variables; the variables may be edited to expand the variables. Therefore, the embodiment of the application does not need a programming basis, is simple to operate, can improve the extraction efficiency, and has strong applicability.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of a text extraction method provided in an embodiment of the present application;

fig. 2 is a schematic view of a specific application scenario of a text extraction method provided in the present application;

fig. 3 is a schematic diagram of a specific application scenario of another text extraction method provided in the present application;

fig. 4 is a schematic diagram of a text extraction apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of another text extraction apparatus according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be understood that the terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only, and is not intended to be limiting of the application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is noted that, as used in this specification and the appended claims, the term "comprises" and any variations thereof are intended to cover non-exclusive inclusions. For example, a system, article, or apparatus that comprises a list of elements/components is not limited to only those elements/components but may alternatively include other elements/components not expressly listed or inherent to such system, article, or apparatus.

It is also understood that the term "if" may be interpreted as "when", "upon" or "in response to" determining "or" in response to detecting "or" in the case of … "depending on the context.

It should also be noted that the terms "first," "second," "third," "fourth," and the like in the description and in the claims, are used for distinguishing between different objects and not necessarily for describing a particular sequential or chronological order.

Referring to fig. 1, fig. 1 is a schematic diagram of a text extraction method provided in an embodiment of the present application, where the method includes, but is not limited to, the following steps:

101. acquiring a document, wherein the document comprises a plurality of texts to be extracted and other texts except the plurality of texts to be extracted which are distributed in lines and columns of the document; the text to be extracted comprises variable labels and variables, columns in the document extend infinitely, and at most one text to be extracted exists in each line in the document.

In this application, a document is obtained in which there are many rows and columns, and the columns are extended wirelessly. The document comprises a plurality of texts to be extracted and a plurality of texts not to be extracted, at most one text to be extracted exists in each line, that is, each line of the document may only comprise the text to be extracted, or may only comprise the plurality of texts to be extracted, or may comprise both the text to be extracted and the plurality of texts not to be extracted, and the texts to be extracted and the texts not to be extracted are alternately distributed, wherein the texts not to be extracted are other texts except the texts to be extracted, the texts to be extracted comprise variable tags and variables, and the variable tags are used for indicating the variables. For example, in a text to be extracted "name ═ Jhon", the variable tag is "name ═ and the variable is" Jhon "; the text "fruit: orange, the variable label is "fruit: ", variables are" orange "; in the text "result-98" to be extracted, the variable label is "result-98", the variable is "98", and the like.

102. M space characters are added before a plurality of texts to be extracted, wherein M is larger than or equal to K, K represents the maximum column number of the document contents, and K is a fixed value.

Adding M space characters before a plurality of texts to be extracted, wherein M is larger than or equal to the maximum column number K of the document contents in the document, and the column numbers of the texts to be extracted in the document are all larger than K, so that a plurality of space characters exist between the texts to be extracted in the document and the non-texts to be extracted before the texts to be extracted.

It should be noted that before adding M space characters before a plurality of texts to be extracted, a first instruction of a user needs to be received; the first instruction is for indicating a first number M.

In one embodiment, a plurality of space characters (less than or equal to K) are added before a plurality of texts to be extracted, so that the number of columns where the plurality of texts to be extracted in a document are all larger than a first maximum value, wherein the first maximum value is the maximum value between the number of columns where texts not to be extracted in a row of texts not to be extracted are located and the number of columns where the texts not to be extracted are distributed before the texts to be extracted in the row of texts to be extracted. Of course, before adding a plurality of space characters before a plurality of texts to be extracted, an instruction of the user needs to be received, wherein the instruction is used for indicating the number of the space characters needing to be added.

The texts to be extracted are distributed in certain lines of the document, and each line has at most one text to be extracted. The method comprises the steps of adding a plurality of space characters with the same number before a plurality of texts to be extracted, enabling the number of columns where each text to be extracted is located to be larger than a first maximum value, wherein the first maximum value refers to the maximum value between the number of columns where texts which are not to be extracted are located in a row of the texts which are not to be extracted and the number of columns where the texts which are not to be extracted are distributed in the row of the texts to be extracted and are located in front of the texts to be extracted. That is to say, after space characters are added before a plurality of texts to be extracted, the number of columns where texts to be extracted are located is not only behind the number of columns where texts to be extracted are located in the rows of texts to be extracted, but also behind the number of columns where texts to be extracted are located in the rows of texts to be extracted, which are distributed before the texts to be extracted, in the rows of texts to be extracted, and a large number of space characters exist between the texts to be extracted in each row and the texts to be extracted, which are distributed before the texts to be extracted in the row, so that subsequent operations are facilitated. The text line to be extracted refers to a line in the document which only includes the text to be extracted, the text line to be extracted refers to a line in the document which includes the text to be extracted, and the text line to be extracted may only include the text to be extracted or may include both the text to be extracted and the text not to be extracted. According to the method and the device, a plurality of space characters can be added before a plurality of texts to be extracted by searching and replacing functions and combining the regular expression.

It should be noted that, if there is no line including only the non-to-be-extracted text in the document, that is, there is no non-to-be-extracted text line, the default number of columns where the non-to-be-extracted text line is located is 0; if no non-to-be-extracted text exists before the to-be-extracted text of all the text lines to be extracted in the document, the default is that the number of columns of the non-to-be-extracted text distributed in the text lines to be extracted before the text to be extracted is 0; if the number of columns of the non-to-be-extracted text lines and the number of columns of the non-to-be-extracted texts distributed in the to-be-extracted text lines before the to-be-extracted text lines are both 0, the first maximum value is 0, the to-be-extracted text in each line of the document is located in front of the non-to-be-extracted text, and in this case, space characters do not need to be added in front of the plurality of to-be-extracted texts.

103. Deleting all text content in the M columns in the document from the first column so that only space characters are left before each text to be extracted.

A plurality of space characters exist between the text to be extracted in each line and the text which is distributed in the line and is not to be extracted before the text to be extracted, and the text content of M lines in the document is deleted from the first line, so that the text which is not to be extracted and is positioned before the text to be extracted is deleted, and only the space characters are left before the text to be extracted in each line in the document.

When deleting, deleting the non-to-be-extracted texts which are distributed in the document and are in front of the to-be-extracted texts, namely, the deleted non-to-be-extracted texts in each row are all M columns from the first column. For example, if the number of columns deleted from the first row in the document is 1 st column, 2 nd column, and 3 rd column, and three columns are deleted in total, the number of columns deleted from other rows in the document is also three columns, namely 1 st column, 2 nd column, and 3 rd column, and the columns deleted from all rows in the document are complete.

It should be noted that after deleting M columns in the document, only space characters are left in front of the text to be extracted in each line, and the original text line that is not to be extracted becomes a blank line. Therefore, the document does not have a text line to be extracted, and the text line to be extracted only includes a text to be extracted and space characters, or includes the text to be extracted and non-text to be extracted distributed behind the text to be extracted.

According to the method and the device, in one embodiment, when the non-to-be-extracted text rows exist in the document, after M columns in the document are deleted, empty rows exist in the obtained document only with the to-be-extracted text rows, and the empty rows in the document can be deleted, so that the empty rows do not exist in the document. When the empty row is deleted, the empty row in the document can be deleted by combining a search and replacement function and a regular expression, so that the empty row does not exist in the document.

104. Adding N space characters after a plurality of texts to be extracted, wherein N is larger than or equal to K.

Adding N space characters (N is larger than or equal to K, and the value of N can be equal to the value of M or not) after a plurality of texts to be extracted, so that the number of columns of a plurality of texts which are not to be extracted in the document is larger than the maximum number of columns of the texts to be extracted. The method includes the steps that only text lines to be extracted exist in a document, N space characters are added after texts to be extracted of the text lines to be extracted, the number of columns where non-texts to be extracted are located in each line is larger than the maximum number of columns where the texts to be extracted are located, namely the number of columns where the non-texts to be extracted are located is larger than the number of columns where any one text to be extracted is located, and a plurality of space characters exist between the texts to be extracted and the text to be extracted in each line. It should be noted that before adding N space characters after a plurality of texts to be extracted, a second instruction of the user needs to be received; the second instruction is for indicating a second number M.

105. And deleting all text contents in the N columns from the K +1 th column in the document, so that only space characters are left after each text to be extracted.

Columns in the document are deleted starting with the K +1 column (at least one column after the maximum number of columns where text is to be extracted) so that only space characters remain after each row of text is to be extracted. Deleting a plurality of columns in the document from at least one column behind the maximum column number of the text to be extracted, so that only space characters are left behind the text to be extracted in each row, namely deleting the text which is not to be extracted and distributed behind the text to be extracted in each row, and obtaining the document only comprising the text to be extracted.

When deleting, deleting the non-to-be-extracted texts distributed in the document after the text to be extracted, that is, the number of columns in which the deleted non-to-be-extracted texts in each row are located is the same, and the total number of the deleted columns is also the same. For example, if the number of columns deleted from the third row in the document is column 6, column 7, and column 8, and a total of three columns are deleted, then the number of columns deleted from other rows in the document is also three columns, column 6, column 7, and column 8, and the columns deleted from all rows in the document are complete.

It should be noted that after deleting the text that is not to be extracted and is distributed after the text to be extracted, only the text that is not to be extracted is deleted or the text that is not to be extracted and a part of space characters (a part of the space characters added in 104) are deleted, so there are space characters after the text to be extracted, and only space characters remain after the text to be extracted.

In this application, in an embodiment, if there is no empty line in the document including only the text line to be extracted, after deleting a plurality of columns from at least one column after the text to be extracted, there is no empty line in the obtained document.

It should be noted that the operation of deleting the empty line in the document may be completed in any step after only space characters remain before the text to be extracted in each line, such as in the foregoing step 103, or in this step, or in the subsequent step 106, which is not limited herein.

106. All space characters in the document are deleted to obtain a document including only a plurality of texts to be extracted.

After deleting the non-to-be-extracted text distributed before the text to be extracted and the non-to-be-extracted text distributed after the text to be extracted, a plurality of space characters exist in the document before the text to be extracted and after the text to be extracted, and the document comprising only a plurality of texts to be extracted is obtained by deleting all the space characters in the document before the text to be extracted and after the text to be extracted, wherein the plurality of texts to be extracted are distributed in the document regularly, each line of the document only has one text to be extracted, and each text to be extracted is positioned at the beginning position of the line (namely, each character in the text to be extracted is distributed in sequence from the first column of the line). During the deletion process, the space characters can be deleted through a search and replacement function or in combination with a regular expression.

In the application, after the document including only a plurality of texts to be extracted is obtained, repeated items may exist in the plurality of texts to be extracted, and the document including a plurality of different texts to be extracted can be obtained by deleting the repeated items in the plurality of texts to be extracted.

In the method, after a document comprising only a plurality of texts to be extracted is obtained, each text to be extracted comprises a variable label and a variable, firstly, the plurality of variable labels are deleted to obtain the document comprising a plurality of variables and space characters; then, the space character in the document is deleted, and the document comprising a plurality of variables is obtained. Of course, there may be duplicate entries in the variable tags, and if necessary, the duplicate entries may be deleted to obtain different variables.

In addition, in the method and the device, after the document comprising the plurality of different texts to be extracted is obtained, the plurality of different texts to be extracted can be edited, so that the expansion operation of each text to be extracted is realized; and different variables in the obtained variable documents including different variables can be edited, so that the expansion of each variable is realized. For example, macro definition may be performed on each variable, or calculation may be performed on each variable (data), or the like.

It should be noted that, in the present application, operations of deleting a plurality of columns may be completed at the same time, and operations of adding a plurality of space characters may also be completed at the same time. The steps in this application are not fixed, and may be completed according to the steps described in this application, or completed according to

steps

101, 104, 105, 102, 103, and 106, or completed according to

steps

101, 102, 104, 103, 105, and 106, or completed according to

steps

101, 104, 102, 103, 105, and 106, and so on, or may be adjusted according to actual operating conditions, and the present application does not limit the operation sequence.

It should be further noted that the present application is described by taking an example that there is at most one text to be extracted in each line of the document. If there are multiple texts to be extracted in each line, the multiple texts to be extracted in the document can also be extracted by operations similar to the above operations, such as adding multiple space characters in the same number, deleting multiple columns, deleting multiple space characters, deleting empty lines, and the like.

The following description is given by taking an example that at most two texts to be extracted exist in each line, and roughly includes, but is not limited to, the following steps:

1) acquiring a document, wherein the document comprises a plurality of texts to be extracted and a plurality of texts which are not to be extracted, and at most two texts to be extracted exist in each line;

2) adding a plurality of space characters before a plurality of texts to be extracted, wherein the plurality of space characters are used for enabling the column where the first text to be extracted of each row is located to be larger than a first maximum value, and the first maximum value is the maximum value of the number of columns where the text to be extracted is located in the text row not to be extracted and the number of columns where the text to be extracted is located in the text row to be extracted and is distributed before the first text to be extracted; the text behavior document to be extracted only comprises lines of the text to be extracted, and the text behavior document to be extracted comprises lines of the text to be extracted;

3) receiving a third instruction of the user, and deleting a plurality of columns in the document from the first column so that only space characters are left before the first text to be extracted of each line;

it should be noted that, after the above three steps are finished, in the obtained document, only a space character exists before the first text to be extracted, there is no text to be extracted, and there is no text line to be extracted in the document (after the deletion process of step 3), the text line to be extracted in the document has been deleted, but the text line to be extracted becomes a blank line after deletion);

4) adding a plurality of space characters after the plurality of texts to be extracted, wherein the plurality of space characters are used for enabling the number of columns of the plurality of first texts to be extracted in the document to be smaller than the minimum number of columns of the plurality of texts not to be extracted;

it should be noted that, in this case, many space characters exist between the first text to be extracted and a plurality of non-text to be extracted located after the first text to be extracted;

5) adding a plurality of space characters before a plurality of second texts to be extracted, wherein the space characters are used for enabling the number of columns where the second texts to be extracted in each row are located to be larger than a second maximum value, and the second maximum value is a value of the maximum number of columns where all the texts which are not to be extracted and are located behind the first texts to be extracted are located;

it should be noted that, this step makes many space characters exist between the second text to be extracted and the non-text to be extracted before the second text to be extracted; from 4) and 5), it can be known that, at this time, a plurality of space characters exist between the second text to be extracted and the non-text to be extracted before the second text to be extracted, and a plurality of space characters exist between the first text to be extracted and the non-text to be extracted before the second text to be extracted;

6) receiving a fourth instruction of the user, and deleting a plurality of columns from at least one column behind the first text to be extracted so that no text to be extracted exists before the second text to be extracted;

7) adding a plurality of space characters after the plurality of texts to be extracted, wherein the plurality of space characters are used for enabling the number of columns of the second texts to be extracted in the document to be smaller than the minimum number of columns of the plurality of texts not to be extracted;

8) receiving a fifth instruction of the user, and deleting a plurality of columns in the document from at least one column behind the column of the second text to be extracted so that only space characters are left behind the second text to be extracted in each line;

9) all space characters in the document are deleted to obtain a document including only a plurality of texts to be extracted.

Of course, after obtaining a document including only a plurality of texts to be extracted, the repeated items in the plurality of texts to be extracted may also be deleted to obtain a plurality of different texts to be extracted, the variable tags in the texts to be extracted may also be deleted to obtain a plurality of documents including only variables, and the plurality of variables may also be edited to expand the plurality of variables.

If three or more texts to be extracted exist in each line of the document, the principle and operation are similar, and for the simplicity of the description, the details are not repeated here.

It can be seen that, in the present application, after a document is obtained, first, M space characters (M is greater than or equal to the maximum column number K of document contents) are added before a text to be extracted, so that a column in which the text to be extracted is located both after a column in which a text not to be extracted is located in a text line not to be extracted and after a column in which a text not to be extracted is located in a text line to be extracted is located before the text to be extracted, then, all text contents in the M columns in the document are deleted from the first column, so that only space characters remain before the text to be extracted, similarly, N space characters are added again after the text to be extracted, so that a column in which a text not to be extracted is located behind the maximum column in which a text to be extracted is located, then, all text contents in the N columns in the document are deleted from the K +1 column, so that only space characters remain behind the text to be extracted, finally, a document including only the text to be extracted can be obtained by deleting the space characters located before and after the text to be extracted. Certainly, if repeated items exist in a plurality of texts to be extracted, documents comprising different texts to be extracted can be obtained by deleting the repeated items; the text to be extracted comprises variable tags and variables, and the variable tags in a plurality of texts to be extracted can be deleted to obtain a document only comprising the variables; the variables may be edited to expand the variables. Therefore, the embodiment of the application does not need a programming basis, is simple to operate, improves the efficiency of extracting the text to be extracted, and has strong applicability.

In order to understand the present application more clearly, the method in the present application is described below by taking two specific application examples as examples.

In an application embodiment, one xsd document includes a plurality of texts to be extracted, that is, "name ═ xxx" ("xxx") and a plurality of texts not to be extracted, where xxx indicates that any character can be in the quotation mark, and the plurality of texts not to be extracted refer to other characters except for the text to be extracted, and referring to the document shown in (1) in fig. 2, the plurality of texts to be extracted, that is, "name ═ xxx" "in the document, need to be extracted. First, a document can be acquired (opened) by a document editor; then, adding enough space characters before a plurality of texts to be extracted, namely "name" ("xxx"), so that a plurality of texts to be extracted, namely "name" ("xxx"), are located both after the number of columns of texts to be extracted in the rows of texts to be extracted, which are not in the rows of texts to be extracted, and after the number of columns of texts to be extracted, which are distributed in the rows of texts to be extracted and are not in the rows of texts to be extracted, so as to obtain a document as shown in (2) in fig. 2, wherein the same number of space characters can be added before a plurality of texts to be extracted, namely "name" ("xxx") by searching for the way that "name" ("the number of space characters here satisfies the above condition), or a plurality of space characters can be added by a regular space expression; secondly, deleting a plurality of columns in the document before each text to be extracted, namely deleting a text to be not extracted before each text to be extracted, namely "name", "xxx", so that only a space character (which is not deleted) is left before the text to be extracted, namely "name", "xxx", of each row, wherein after deleting a plurality of columns, an original text row which is not to be extracted is changed into an empty row, which is shown in (3) in fig. 2, wherein the deletion of the plurality of columns can be completed through a column deletion function in a column function mode; then, in order to obtain a text to be extracted, "name" ═ xxx ", the text to be extracted that is not to be extracted and is located after the text to be extracted," name "═ xxx" is also required to be deleted, but before deletion, a plurality of space characters are added after the text to be extracted, "name" ═ xxx ", so that a plurality of texts to be not extracted in the document are all located after the text to be extracted," name "═ xxx", as shown in (4) in fig. 2, wherein a plurality of space characters can be added in a manner of searching for characters and replacing characters, or in a manner of regular expression; then, starting from a column in which any space character located after the text to be extracted "name" — "xxx" and before the text not to be extracted is located, deleting the text not to be extracted in the document, so that only space characters remain after each text to be extracted, as shown in (5) in fig. 2, where deleting a plurality of columns can be accomplished by a delete column function in a column function mode; finally, deleting all space characters before and after the text to be extracted "name ═ xxx", and obtaining the document shown in (6) in fig. 2, it should be noted that after deleting all space characters, the text to be extracted "name ═ xxx" starts to be distributed from the start position (first column) of each row, wherein the process of deleting space characters can be completed by searching "" and replacing "" with "".

It should be noted that, since there are empty rows in the document, the document may be represented by a regular expression, for example: searching for ^ r ^ n ^ r ^ n 'to be replaced by ^ r ^ n', deleting all empty lines in the document, and obtaining the document (the document without the empty lines) only comprising the text to be extracted, as shown in (7) in FIG. 2; it can be seen that there are duplicate items in the document, and the duplicate items may be deleted, so as to obtain local documents including different texts to be extracted, as shown in (8) in fig. 2, where the function of deleting duplicate items may be adopted, for example, multiple texts to be extracted may be pasted into an excel form, and the duplicate items are deleted by using the "delete duplicate item" function in the excel tool.

In this embodiment, the text "name ═ xxx" to be extracted includes a variable tag "name ═ and a variable" xxx ", where xxx denotes that any character may be in the quotation mark, and if the variable" xxx "in the document needs to be extracted, the document shown in (9) in fig. 2 may be obtained by searching for" name ═ instead "or deleting a plurality of variable tags" name ═ in the document shown in (8) in fig. 2 using a regular expression; in this embodiment, multiple variables in the document (9) in fig. 2 may also be edited to implement an extension operation, for example, macro definition may be performed on multiple variables, multiple variables may be copied into a table of an excel tool, columns may be added before multiple variables to edit column contents, and macro definition is completed, which is shown in (10) in fig. 2.

In yet another application embodiment, an office document includes a plurality of texts to be extracted, i.e., "RMB xx element", where xx may be any character, and a plurality of texts to be extracted, i.e., other characters except the text to be extracted, "RMB xx element", and referring to the document shown in (1) in FIG. 3, the plurality of texts to be extracted, i.e., "RMB xx element", in the document now needs to be extracted. First, a document can be acquired (opened) by a document editor; then, adding enough space characters before a plurality of texts to be extracted, namely the Renminbi xx elements, so that the plurality of texts to be extracted, namely the Renminbi xx elements, are positioned behind the number of columns of the non-to-be-extracted texts in the non-to-be-extracted text rows and behind the number of columns of the non-to-be-extracted texts distributed in the to-be-extracted text rows and in front of the to-be-extracted texts, and obtaining a document shown in (2) in FIG. 3, wherein the Renminbi can be searched for being replaced by the Renminbi, and the number of space characters meets the condition; secondly, deleting a plurality of columns in the document before each text to be extracted, namely deleting the text to be not extracted before each text to be extracted, namely the text xx, of each row is only left with (undeleted) space characters, wherein after the plurality of columns are deleted, the original text row to be extracted becomes a blank row, and the blank row is shown in (3) in fig. 3 (the last 3 of the document is a blank row), wherein the plurality of columns can be deleted through a column function; then, in order to obtain the text to be extracted, namely the "rmb xx element", the non-to-be-extracted text which is located behind the text to be extracted, namely the "rmb xx element", needs to be deleted, but before deletion, a plurality of space characters are added behind the text to be extracted, namely the "rmb xx element", so that the plurality of non-to-be-extracted texts in the document are located behind the text to be extracted, namely the "rmb xx element", as shown in (4) in fig. 3 (the last 3 behaviors of the document are empty); then, starting from a column where any space character is located after the text to be extracted, namely the Renminbi xx element, and before the text not to be extracted, deleting the text not to be extracted in the document, so that only space characters remain after each text to be extracted, as shown in (5) in FIG. 3 (the last 3 lines of the document are empty); finally, deleting all space characters before and after the text "RMB xx element" to be extracted to obtain the document shown as (6) in FIG. 3 (the last 3 lines of the document are empty), wherein it should be noted that after all space characters are deleted, the text "RMB xx element" to be extracted is distributed from the starting position (the first column) of each line. In this embodiment, since there are also empty rows in the document, the document may be represented by a regular expression, for example: searching for ^ p ^ p ' and replacing the ^ p ' with ^ p ', deleting all blank lines in the document, and obtaining the document only comprising the text to be extracted, which is shown in (7) in the figure 2; if the variable "xx element" in the document needs to be extracted, a plurality of variable tags "Renminbi" in the document shown in (7) in FIG. 2 can be deleted by searching a mode or a regular expression that "Renminbi" is replaced by "Renminbi", so as to obtain the document shown in (8) in FIG. 2.

Referring to fig. 4, fig. 4 is a schematic diagram of a text extraction apparatus 40 provided in an embodiment of the present application, where the apparatus 40 is applied to a method for text extraction, and includes:

an obtaining unit 401, configured to obtain a document, where the document includes multiple texts to be extracted and other texts except the multiple texts to be extracted, and the texts are distributed in rows and columns of the document; the text to be extracted comprises variable tags and variables, columns in the document extend infinitely, and at most one text to be extracted exists in each line in the document;

an adding unit 402, configured to add M space characters before a plurality of texts to be extracted, where M is greater than or equal to K, K represents the maximum column number of the document content, and K is a fixed value;

a deleting unit 403, configured to delete all text contents in M columns in the document from the first column so that only space characters remain before each text to be extracted;

the adding unit 402 is further configured to add N space characters after a plurality of texts to be extracted, where N is greater than or equal to K;

the deleting unit 403 is further configured to delete all text contents in N columns from the K +1 th column in the document, so that each text to be extracted leaves a space character;

the deleting unit 403 is further configured to delete all space characters in the document to obtain a document including only a plurality of texts to be extracted.

In a possible embodiment, after obtaining a document including only a plurality of texts to be extracted, the deleting unit 403 is further configured to: and deleting the variable tags in the texts to be extracted to obtain the document comprising a plurality of variables.

In a possible embodiment, after obtaining a document including only a plurality of texts to be extracted, the deleting unit 403 is further configured to: and deleting repeated items in the texts to be extracted to obtain a document comprising a plurality of different texts to be extracted.

In a possible embodiment, the apparatus further comprises an extension unit 404; the extension unit is used for editing each text to be extracted in the document comprising different texts to be extracted, so as to realize extension operation on each text to be extracted.

In a possible embodiment, before adding M space characters before a plurality of texts to be extracted, the deleting unit 403 is further configured to receive a first instruction; the first instruction is for indicating a first number M; before adding N space characters after a plurality of texts to be extracted, the deleting unit 403 is further configured to receive a second instruction; the second instruction is for indicating a second number N.

The functional modules of the apparatus 40 are used for implementing the method described in the embodiment of fig. 1, and specific contents may refer to the description in the related contents of the embodiment of fig. 1, and for brevity of the description, are not repeated here.

Referring to fig. 5, fig. 5 is a schematic diagram of another text extraction apparatus 600 provided in an embodiment of the present application, where the apparatus 600 is configured to implement a text extraction method, and includes at least: processor 610, communication interface 620, and memory 630, processor 610, communication interface 620, and memory 630 coupled by bus 640. Wherein the content of the first and second substances,

the processor 610 is configured to execute the acquiring unit 401, the adding unit 402, the deleting unit 403, and the extending unit 404 in fig. 4 by calling the program code in the memory 630. In practical applications, processor 610 may include one or more general-purpose processors, wherein a general-purpose processor may be any type of device capable of Processing electronic instructions, including a Central Processing Unit (CPU), a microprocessor, a microcontroller, a main processor, a controller, and an ASIC (Application Specific Integrated Circuit), among others. The processor 610 reads the program code stored in the memory 630 and cooperates with the communication interface 620 to perform some or all of the steps of the method of the above-described embodiments of the present application performed by the device 600 for preventing a living being from being left in a vehicle.

The communication interface 620 may be a wired interface (e.g., an ethernet interface) for communicating with other computing nodes or devices. When communication interface 620 is a wired interface, communication interface 620 may employ a Protocol family over TCP/IP, such as RAAS Protocol, Remote Function Call (RFC) Protocol, Simple Object Access Protocol (SOAP) Protocol, Simple Network Management Protocol (SNMP) Protocol, Common Object Request Broker Architecture (CORBA) Protocol, and distributed Protocol, among others.

Memory 630 may store program code as well as data information. The program code includes a code of an acquisition unit 401, a code of an addition unit 402, a code of a deletion unit 403, and a code of an extension unit 404. The data information includes: the method comprises the steps of obtaining a document, a text to be extracted, a first maximum value, a document comprising only a plurality of texts to be extracted and the like. In practical applications, the Memory 630 may include a Volatile Memory (Volatile Memory), such as a Random Access Memory (RAM); the Memory may also include a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk Drive (HDD), or a Solid-State Drive (SSD) Memory, which may also include a combination of the above types of memories.

The embodiment of the present application also provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program is executed by hardware (for example, a processor, etc.) to implement part or all of the steps of any one of the methods performed by the text extraction apparatus in the embodiment of the present application.

The embodiment of the present application also provides a computer program product, which, when being read and executed by a computer, causes a text extraction apparatus to perform part or all of the steps of the method for extracting text in the embodiment of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented, in whole or in part, by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, memory Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state Disk, SSD)), among others. In the embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A text extraction method, comprising:

acquiring a document, wherein the document comprises a plurality of texts to be extracted distributed in lines and columns of the document and other texts except the plurality of texts to be extracted; the text to be extracted comprises variable labels and variables, columns in the document extend infinitely, and at most one text to be extracted exists in each row in the document;

adding M space characters before the texts to be extracted, wherein M is more than or equal to K, the K represents the maximum column number of the document contents, and the K is a fixed value;

deleting all text contents in M columns starting from the first column in the document so as to enable each text to be extracted to be left with space characters;

adding N space characters after the texts to be extracted, wherein N is greater than or equal to K;

deleting all text contents in N columns starting from the K +1 th column in the document, so that space characters are left after each text to be extracted;

deleting all space characters in the document to obtain the document only comprising the plurality of texts to be extracted.

2. The method of claim 1, wherein after obtaining the document including only the plurality of texts to be extracted, the method further comprises:

deleting the variable tags in the texts to be extracted to obtain the document only comprising a plurality of variables.

3. The method of claim 1, wherein after obtaining the document including only the plurality of texts to be extracted, the method further comprises:

and deleting repeated items in the texts to be extracted to obtain a document comprising a plurality of different texts to be extracted.

4. The method of claim 2, further comprising:

and editing a plurality of variables in the document comprising the variables to realize the expansion operation of each variable.

5. The method of claim 1,

before adding M space characters before the plurality of texts to be extracted, the method further includes: receiving a first instruction; the first instruction is to indicate a first number M;

before adding N space characters after the plurality of texts to be extracted, the method further comprises: receiving a second instruction; the second instruction is for indicating a second number N.

6. A text extraction device characterized by comprising:

the adding unit is further used for adding N space characters after the plurality of texts to be extracted, wherein N is larger than or equal to K;

7. The apparatus according to claim 6, wherein after the obtaining of the document including only the plurality of texts to be extracted, the deleting unit is further configured to: deleting the variable tags in the texts to be extracted to obtain the document only comprising a plurality of variables.

8. The apparatus according to claim 6, wherein after the obtaining of the document including only the plurality of texts to be extracted, the deleting unit is further configured to: and deleting repeated items in the texts to be extracted to obtain a document comprising a plurality of different texts to be extracted.

9. The apparatus of claim 7, further comprising an expansion unit;

the extension unit is used for editing each text to be extracted in the document comprising different texts to be extracted to realize the extension operation of each text to be extracted.

10. The apparatus of claim 6,

before adding M space characters before the plurality of texts to be extracted, the adding unit is further configured to: receiving a first instruction; the first instruction is to indicate a first number M;

before adding N space characters after the plurality of texts to be extracted, the adding unit is further configured to: receiving a second instruction; the second instruction is for indicating a second number N.