CN116306527A

CN116306527A - Text processing method, model training method, device, equipment and storage medium

Info

Publication number: CN116306527A
Application number: CN202211599089.9A
Authority: CN
Inventors: 杨祎聪; 李晓平; 顾文斌; 孙勇; 刘志强
Original assignee: Shanghai Hengsheng Juyuan Data Service Co ltd; Hangzhou Hengsheng Juyuan Information Technology Co ltd
Current assignee: Shanghai Hengsheng Juyuan Data Service Co ltd; Hangzhou Hengsheng Juyuan Information Technology Co ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-06-23

Abstract

The application provides a text processing method, a model training method, a device, equipment and a storage medium, and relates to the technical field of neural networks. The text processing model is obtained through training the training sample text added with the separation mark, and because the training sample text is marked with the label information and the position information of the separation mark, the label information of the separation mark refers to whether the text at the position of the separation mark needs to be combined or not, the label information of the separation mark is generated according to the real semantics of the text at the position of the separation mark in the training sample text, the label accuracy is higher, and therefore the text processing model obtained through training can be used for accurately combining the target processing text based on the label information and the position information of the separation mark marked by the training sample text. The training sample texts can be obtained by serially connecting a plurality of lines of texts, so that the training text processing model can be suitable for merging the plurality of lines of texts, and the efficiency of merging the plurality of lines of texts is improved.

Description

Text processing method, model training method, device, equipment and storage medium

Technical Field

The application relates to the technical field of neural networks, in particular to a text processing method, a model training method, a device, equipment and a storage medium.

Background

In the text processing process, the problem of complex form merging is often encountered, especially for the page-crossing text and the wireless form text, the merging relation of the text cannot be simply judged from the information such as the uplink and downlink spacing, the indentation and the like.

In the prior art, whether texts between two pairs are combined is generally regarded as a task of two classification, and a plurality of lines of texts are compared in sequence every two pairs, so that whether the two texts are combined is respectively judged, and a final combination result is obtained. Therefore, the text merging efficiency by adopting the method is low.

Disclosure of Invention

The present application aims to provide a text processing method, a model training method, a device, equipment and a storage medium, aiming at the defects in the prior art, so as to solve the problem of low text merging processing efficiency in the prior art.

In order to achieve the above purpose, the technical solution adopted in the embodiment of the present application is as follows:

in a first aspect, an embodiment of the present application provides a text processing method, including:

Reading the text of at least one cell in the file to be processed;

adding a separation mark to the text of the at least one cell to obtain a target processing text;

inputting the target processing text into a pre-trained text processing model, identifying whether texts segmented by each separation mark in the target processing text need to be combined, and combining the target processing text according to the identification result to obtain at least one target text; the text processing model is obtained by training a training sample text with marking information, the marking information comprises label information of a separation mark added into the training sample text and the position of the separation mark, and the label information is generated based on the real semantics of the text of the position of the separation mark in the training sample text.

Optionally, the adding a separation mark to the text of the at least one cell to obtain target processing text includes:

and adding a separation mark between texts of each adjacent cell to obtain target processing texts.

And adding separation marks between texts of each adjacent cell, and adding the separation marks between texts in each cell to obtain target processing texts.

Optionally, the adding a separation mark between the texts in each cell includes:

and inserting a separation mark in at least one random position of the text in each cell to obtain the target processing text.

word segmentation is carried out on the texts in the cells, and word segmentation processing results are obtained;

determining at least one complete word in the text in the cell according to the word segmentation processing result;

determining at least one target word from the at least one complete word;

and adding a separation mark in each target word.

In a second aspect, an embodiment of the present application provides a text processing model training method, including:

collecting a plurality of first initial sample texts, preprocessing the first initial sample texts to obtain a first sample training text set, wherein the first sample training text set comprises a plurality of first training sample texts, each first training sample text is provided with labeling information, and the labeling information comprises: tag information of a separation mark added to a first training sample text and the position of the separation mark, wherein the tag information is generated based on the real semantics of the text of the position of the separation mark in the first training sample text;

And training the text set by adopting the first sample to acquire a text processing model.

Optionally, the collecting a plurality of first initial sample texts and preprocessing the first initial sample texts to obtain a first sample training text set includes:

extracting a plurality of first initial sample texts from at least one sample file with a preset format, wherein each first initial sample text comprises the text of at least one cell in the sample file;

noise reduction is carried out on each first initial sample text, non-text characters in each first initial sample text are deleted, and a first preprocessed sample text corresponding to each first initial sample text is obtained;

adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain a first training sample text;

and obtaining the first sample training text set according to each first training sample text.

Optionally, training the text set using the first sample, training to obtain the text processing model includes:

acquiring a second sample training text set corresponding to the target field, wherein labeling information of each second training sample text in the second sample training text set is labeled by a user;

And training to acquire the text processing model by adopting the first training sample text set and the second training sample text set.

Optionally, the extracting a plurality of first initial sample texts from at least one sample file with a preset format includes:

and sequentially extracting a whole column of cell texts from the wired table in at least one sample file with a preset format according to column directions, and sequentially concatenating the texts as a first initial sample text.

Optionally, the denoising the first initial sample text, and deleting the non-text characters in the first initial sample text to obtain a first preprocessed sample text corresponding to the first initial sample text, which includes:

performing full-angle and half-angle processing on the first initial sample text, and deleting non-text characters in the first initial sample text to obtain a first preprocessed sample text corresponding to the first initial sample text, wherein the non-text characters comprise: preset separators, spaces, hypertext markup language tags, chinese messy codes.

Optionally, the adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain the first training sample text includes:

And deleting the text with the preset length from the sample text after the first pretreatment if the character length of the sample text after the first pretreatment after the separation mark is inserted currently meets the preset length or the number of the inserted separation marks meets the preset number, so as to obtain the first training sample text.

In a third aspect, an embodiment of the present application further provides a text processing apparatus, including: the device comprises a reading module, a marking module and a processing module;

the reading module is used for reading the text of at least one cell in the file to be processed;

the marking module is used for adding a separation mark to the text of the at least one cell to obtain a target processing text;

the processing module is used for inputting the target processing text into a pre-trained text processing model, identifying whether the texts segmented by each separation mark in the target processing text need to be combined or not, and combining the target processing text according to the identification result to obtain at least one target text; the text processing model is obtained by training a training sample text with labeling information, wherein the labeling information comprises labels of separation marks added into the training sample text and positions of the separation marks.

Optionally, the marking module is specifically configured to add a separation mark between the texts of each adjacent cell, so as to obtain the target processing text.

Optionally, the marking module is specifically configured to add a separation mark between the texts of each adjacent cell, and add a separation mark between the texts in each cell, so as to obtain the target processing text.

Optionally, the marking module is specifically configured to insert a separation mark in at least one random position of the text in each cell, so as to obtain the target processing text.

Optionally, the marking module is specifically configured to perform word segmentation on the text in the cell, so as to obtain a word segmentation result;

determining at least one target word from the at least one complete word;

and adding a separation mark in each target word.

In a fourth aspect, an embodiment of the present application further provides a text processing model training apparatus, including: the system comprises an acquisition module and a training module;

the collection module is used for collecting a plurality of first initial sample texts and preprocessing the first initial sample texts to obtain a first sample training text set, the first sample training text set comprises a plurality of first training sample texts, each first training sample text has labeling information, and the labeling information comprises: tag information of a separation mark and a position of the separation mark are added into a first training sample text, wherein the tag information is generated based on real semantics of a text of the position of the separation mark in the training sample text;

And the training module is used for training the text set by adopting the first sample and training to acquire a text processing model.

Optionally, the acquisition module is specifically configured to

Optionally, the training module is specifically configured to

Optionally, the acquisition module is specifically configured to

In a fifth aspect, embodiments of the present application provide an electronic device, including: a processor, a storage medium, and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method as provided in the first or second aspect when executed.

In a sixth aspect, embodiments of the present application provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as provided in the first or second aspect.

The beneficial effects of this application are:

the text processing model is obtained through training a training sample text added with separation marks, the training sample text is marked with the label information and the position information of the separation marks, the label information of the separation marks refers to whether texts at the positions of the separation marks need to be combined or not, the label information of the separation marks is generated according to the real semantics of the texts at the positions of the separation marks in the training sample text, the label accuracy is high, and therefore the text processing model obtained through training can be used for accurately combining target processing texts based on the label information and the position information of the separation marks marked by the training sample text. The training sample texts can be obtained by serially connecting a plurality of lines of texts, so that the training text processing model can be suitable for merging the plurality of lines of texts, and the efficiency of merging the plurality of lines of texts is improved.

In addition, a sample training text set corresponding to the target field is obtained by manually marking samples aiming at the target field, so that the sample training text set in an automatic marking mode is subjected to sample expansion, the marking cost is saved on the basis of maintaining the universality of the sample and improving the accuracy of a specific scene, and the generalization of a text processing model obtained by training is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a text processing method according to an embodiment of the present application;

fig. 2 is a second schematic flow chart of a text processing method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a training method of a text processing model according to an embodiment of the present application;

fig. 4 is a second flow chart of a text processing model training method according to an embodiment of the present application;

Fig. 5 is a flowchart of a text processing model training method according to an embodiment of the present application;

FIG. 6 is a network schematic diagram of a text processing model according to an embodiment of the present application;

fig. 7 is a schematic diagram of a text processing device according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a text processing model training apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.

In addition, the described embodiments are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

In order to enable one skilled in the art to use the present disclosure, the following embodiments are presented in connection with a specific application scenario "form text merging in PDF files". It will be apparent to those having ordinary skill in the art that the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present application.

It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but not to exclude the addition of other features.

The PDF (Portable Document Format ) format is widely applied to storage and transmission of various files, and information is often required to be extracted from the PDF document, and a large number of complex form merging problems which can only be judged by semantics are encountered when the information is extracted from the PDF, such as wireless forms, page crossing forms and the like, wherein a plurality of lines of texts are frequently generated in the wireless forms, the wireless forms cannot be judged simply by information such as up-down spacing, indentation and the like, but are judged well in terms of semantics, and whether page crossing forms are merged or not is judged by the semantic relation of cell texts.

Based on the above, when the problem of merging the forms is solved by adopting a semantic judgment mode, the text processing model can be trained and obtained by inserting the separation mark into the obtained sample text, whether the text at the position where the separation mark is inserted needs to be merged or not is determined by using the actual semantic information of the sample text, and then the text processing model obtained by training the sample text inserted with the separation mark as a training sample can be used for accurately judging whether any lines of texts extracted from the forms are merged or not.

In the conventional text merging problem processing by using a training model, whether texts between two lines of acquired texts are merged is generally regarded as a task of two classification, and multiple pairwise comparisons are required for the lines of texts. For example: assuming that there are 4 lines of text, two-by-two comparisons between 1 and 2, 2 and 3, 3 and 4 are required, where the 2 nd and 3 rd lines of text are input into the model twice, respectively, then the time for model predictions is significantly increased in models that are sensitive to sequence length. On the other hand, text merging requires a large amount of low-cost automatic labeling corpus, and the corpus generation method for merging the texts in pairs is not suitable for multi-text merging.

Based on the above, the application provides a method applicable to merging multiple lines of texts and a text processing model training method, wherein the training samples are improved, the multiple lines of texts are connected in series to obtain the training samples, and the training samples are automatically marked for training a text processing model applicable to merging multiple lines of texts. Compared with the two-classification processing mode, the frequency of inputting the training sample into the model is reduced, the total length of the training sample is smaller, the training efficiency of the model is higher, and the efficiency of the model for presuming the merging result is improved.

The relevant steps of the method are described below by means of specific examples.

Fig. 1 is a schematic flow chart of a text processing method according to an embodiment of the present application; the subject of execution of the method may be a computer device or a server. As shown in fig. 1, the method may include:

s101, reading text of at least one cell in the file to be processed.

The method can be applied to files with any formats, and is not limited to files with rich text formats such as PDF (portable document format) files, HTML (HyperText markup-up Language) files and the like. When a plurality of lines of texts exist in a file to be processed, the text which originally belongs to the same section is divided into a plurality of sections by mistake due to the fact that the text possibly has the problems of line feeding by mistake and the like in the writing process, so that the text is not smooth to read, and text information is extracted by mistake, and therefore the texts to be combined can be combined through text combining processing, the texts which are not required to be combined are kept independent, and the text with complete information is finally obtained.

In this embodiment, the to-be-processed file is taken as a PDF file, and the PDF file includes a table text to illustrate the method, where in practical application, the to-be-processed file may be another file, and the file may also not include a table file.

The text of at least one cell in the file to be processed can be read in a mode of autonomous scanning of the file or a mode of image recognition of the file, and the at least one cell can comprise adjacent cells or non-adjacent cells.

S102, adding a separation mark to the text of at least one cell to obtain a target processing text.

Adding a separator mark to text of at least one cell may include adding a separator mark to text between two cells, or may include adding a separator mark to text within one cell.

In some embodiments, a separation mark may be added to the text of each cell to be read until the text of all cells to be read is read, and the process is finished, so as to obtain the target processing text.

In other embodiments, the text of all the cells to be read may be read first, the texts of all the cells are connected in series according to the reading sequence, and then the separation mark is added to the texts obtained by the series connection in a unified manner, so as to obtain the target processing text.

S103, inputting the target processing text into a pre-trained text processing model, identifying whether the texts segmented by each separation mark in the target processing text need to be combined, and combining the target processing text according to the identification result to obtain at least one target text; the text processing model is obtained by training a training sample text with labeling information, the labeling information comprises label information of separation marks added to the training sample text and positions of the separation marks, and the label information is generated based on real semantics of the text where the separation marks are located in the training sample text.

Optionally, the input target processing text may be identified by using a text processing model obtained by training in advance, and the target processing text may be combined by identifying a text combining manner indicated by each separation mark added in the target processing text.

The text merging mode indicated by the separation mark can comprise: text needs to be merged and text does not need to be merged. When the mode indicated by the separation mark is that the texts need to be combined, the texts at the position added by the separation mark are combined, and when the mode indicated by the separation mark is that the texts do not need to be combined, the texts at the position added by the separation mark are not combined, and under the condition that the texts before the separation mark are taken as one text and the texts after the separation mark are taken as one text, at least one target text after the target processing text is combined is obtained.

Alternatively, the segmentation markers in the training sample text can be added manually according to semantic information of the text, or can be automatically added according to real semantic information after semantic recognition is performed on the training sample text. Because of the deviation of semantic analysis of different people on the same text, when the artificial addition is performed, the segmentation mark addition result of multiple people can be integrated for the same training sample text to obtain a final addition result.

The segmentation marks in the training sample text are added according to the semantic information of the text, and the semantic information of the text can clearly determine whether the text needs to be segmented, so that the accuracy of the segmentation marks added in the training sample text is higher, and the accuracy of the label information of the obtained segmentation marks is higher.

The text processing model is obtained through training the training sample text added with the separation mark, the training sample text is also marked with the label information and the position information of the separation mark, the label information of the separation mark can refer to whether the text at the position of the separation mark needs to be combined or not, the label information of the separation mark is generated according to the real semantics of the text at the position of the separation mark in the training sample text, the label accuracy is high, and therefore the text processing model obtained through training can be used for accurately combining the target processing text based on the label information and the position information of the separation mark marked by the training sample text.

In summary, according to the text processing method provided by the embodiment, the text processing model is obtained by training the training sample text added with the separation mark, and because the training sample text is marked with the label information and the position information of the separation mark, the label information of the separation mark refers to whether the text at the position of the separation mark needs to be combined or not, and the label information of the separation mark is generated according to the real semantics of the text at the position of the separation mark in the training sample text, the label accuracy is higher, so that the text processing model obtained by training can be used for accurately combining the target processing text based on the label information and the position information of the separation mark marked by the training sample text. The training sample texts can be obtained by serially connecting a plurality of lines of texts, so that the training text processing model can be suitable for merging the plurality of lines of texts, and the efficiency of merging the plurality of lines of texts is improved.

Optionally, in step S102, adding a separation mark to the text of at least one cell to obtain a target processing text may include: and adding a separation mark between texts of each adjacent cell to obtain target processing texts.

In one implementation, a separation mark may be added only between text of adjacent cells, while no separation mark is added between text within one cell, where adding a separation mark between text of adjacent cells may also be understood as adding a separation mark between text of independent cells.

For example: text of 3 cells was extracted: cell 1-weather today; cell 2-where you want to go to play; cell 3-we go to bar together. Then, the target processed text obtained after adding a separation mark between the texts of each adjacent cell may be: today weather is good/you want to go where to play/we go to the bar together.

In this embodiment, "/" is used as a separation mark, and in practical application, [ sep ] may be a symbol representing the end of a sentence, or any other symbol.

Optionally, in step S102, adding a separation mark to the text of at least one cell to obtain a target processing text may include: and adding separation marks between texts of each adjacent cell, and adding the separation marks between texts in each cell to obtain target processing texts.

In another embodiment, a separator mark may be added between text within each cell along with a separator mark between text within each adjacent cell.

Optionally, in the step, adding a separation mark between the texts in each cell may include: and inserting a separation mark in at least one random position of the text in each cell to obtain the target processing text.

There may be two ways to add a separation mark between texts in each cell, and in one way, a separation mark may be randomly added between texts in a cell.

For example: text of 3 cells was extracted: cell 1-weather today; cell 2-where you want to go to play; cell 3-we go to bar together. Then, the target processed text obtained after adding a separation mark between the texts of each adjacent cell and inserting the separation mark at least one random position of the text within each cell may be: today/weather is good/where you want to go/play/we go/bar together.

Fig. 2 is a second schematic flow chart of a text processing method according to an embodiment of the present application; optionally, the step of adding a separation mark between the texts in each cell may include:

S201, word segmentation is carried out on the texts in the cells, and word segmentation processing results are obtained.

In addition to the way of randomly adding the separation marks, the word segmentation process can be performed on the text in the single cell to obtain a word segmentation result, and the word segmentation process can be performed by extracting complete words in the text.

S202, determining at least one complete word in the text in the cell according to the word segmentation processing result.

Through the word segmentation process, the text in the cell can be divided into a plurality of words, and some words may be a word, some words may be a complete word, and in this embodiment, the complete word is determined.

Continuing with the above example, assuming that the text in cell 1 is subjected to word segmentation, the word segmentation result may include: today; weather; the method is good. Wherein, all three words obtained by word segmentation belong to complete words.

Assuming that the text in the cell 2 is subjected to word segmentation, the word segmentation result may include: you want to go; where to; playing; and (3) pressing. Then where the complete word is.

S203, determining at least one target word from at least one complete word.

Alternatively, one or more of the complete words obtained by word segmentation may be selected as the target word.

S204, adding a separation mark in each target word.

And adding separation marks in the determined target words, wherein the separation marks are added in the target words, namely adding the separation marks between the target words, so that the target words are where, and adding the separation marks is what/what.

Of course, the above is only some optional insertion ways of the separation mark, and other addition ways are also possible in practical application.

In one implementation, when there are multiple target processing texts, in order to avoid confusion of different target processing texts when inputting a text processing model, special marks may be inserted at the beginning and the end of the target processing texts respectively to distinguish different target processing texts.

For example, the target processing text is: today/weather better/where you want to go/play/we go/bar together, then the target process text in the input text process model may be: [ cls ] today/weather better/where you want to go/play/we go/bar together [ sep ].

The above is a description of the relevant steps of the text processing model application process, and the following describes the training mode of the text processing model.

The above embodiment describes an application process of a text processing model, and applies the text processing model to a text merging scene, so as to accurately merge cell texts in a file to be processed. The training process of the text processing model will be described next by way of specific embodiments.

FIG. 3 is a flowchart illustrating a training method of a text processing model according to an embodiment of the present application; optionally, the method may further include:

s301, collecting a plurality of first initial sample texts, preprocessing the first initial sample texts to obtain a first sample training text set, wherein the first sample training text set comprises a plurality of first training sample texts, each first training sample text has labeling information, and the labeling information comprises: tag information of the separator mark added to the first training sample text and the position of the separator mark, the tag information being generated based on the true semantics of the text in which the separator mark is located in the first training sample text.

Similar to the application process of the text processing model, the collected first initial sample text can also be composed of multiple lines of text in the training process of the text processing model, so that the trained text processing model can accurately combine the multiple lines of text.

In some embodiments, the acquired first initial sample text may also be preprocessed to obtain first sample training text. The preprocessing may include noise reduction processing, so that the first initial sample text format is more standard, and special mark deletion processing, so as to avoid interference to the separation mark added subsequently.

Optionally, the first sample training text may be automatically labeled, where the labeling is similar to the application process, and a separation label may be added to the first sample training text, so that the first sample training text has labeling information, and whether the texts are combined or not may be determined according to the labeling information training model.

The labeling information comprises the following steps: a label of the separator mark added to the first training sample text, and a position of the separator mark. Here the labels of the separator marks include 0 and 1, where 0 may indicate that the text at the location of the separator mark does not need to be merged and 1 indicates that the text at the location of the separator mark needs to be merged. And the position of the separator mark may refer to the character position of the separator mark in the first sample training text.

For example: the first sample training text is: today/weather is good/where you want to go/play, then the first separation mark is 3, the second separation mark is 8, and the third separation mark is 14.

S302, training a text set by adopting the first sample, and training to obtain a text processing model.

Training the text based on the first sample with the annotation information, and training to obtain a text processing model. It should be noted here that the above is for one first sample training text, and when training the text processing model, a large number of first sample training texts need to be acquired to form a first sample training text set, and the text processing model is acquired by training the first sample training text set.

Fig. 4 is a second flow chart of a text processing model training method according to an embodiment of the present application; optionally, in step S301, collecting a plurality of first initial sample texts, and preprocessing the first initial sample texts to obtain a first sample training text set, which may include:

s401, extracting a plurality of first initial sample texts from at least one sample file with a preset format, wherein each first initial sample text comprises the text of at least one cell in the sample file.

The preset format here may be a PDF format, and a table is included in the PDF file. Of course, in practice, it is not limited to PDF format. A plurality of first initial sample text may be extracted from different PDF sample files.

Alternatively, the extracted first initial sample text may be text containing at least one cell in the sample file, i.e., the first initial sample text may be composed of files extracted from at least one cell in one sample file.

S402, denoising each first initial sample text, deleting non-text characters in each first initial sample text, and obtaining a first preprocessed sample text corresponding to each first initial sample text.

The purpose of the noise reduction process here is to make the first initial sample text format more canonical, facilitating model learning. The purpose of deleting non-text characters is to remove the interference of special characters on text semantics, and specific processing modes can be referred to below.

S403, adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain a first training sample text.

Optionally, the manner adopted in adding the separation mark to the text of at least one cell in the first preprocessed sample text is consistent with adding the separation mark to the text of at least one cell read in the text to be processed in the step S102. And will not be described in detail herein.

S404, obtaining a first sample training text set according to each first training sample text.

The above description is directed to a first training sample text obtaining manner, and a plurality of obtained first training sample texts form a first training sample text set.

Fig. 5 is a flowchart of a text processing model training method according to an embodiment of the present application; optionally, in step S302, training the text set using the first sample to obtain a text processing model may include:

S501, a second sample training text set corresponding to the target field is obtained, and labeling information of each second training sample text in the second sample training text set is labeled by a user.

In some embodiments, for merging texts in some specific fields (target fields), since training sample texts are relatively scarce, a second sample training text set can be extracted from the specific fields, and different from the first sample training text set, labeling information of each second training sample text in the second sample training text set is labeled manually, instead of adopting the automatic labeling mode, and adopting a manual labeling mode to improve the accuracy of labeling samples in the specific fields.

S502, training to obtain a text processing model by adopting the first training sample text set and the second training sample text set.

Alternatively, the first training sample text set and the second training sample text set may be used as training samples to train to obtain a text processing model.

Optionally, in step S401, extracting a plurality of first initial sample texts from at least one sample file having a preset format may include: and sequentially extracting a whole column of cell texts from the wired table in at least one sample file with a preset format according to column directions, and sequentially concatenating the texts as a first initial sample text.

Assuming that the table in the sample file is three rows and four columns, for the first column, the cell text can be sequentially extracted from the first row of cells, and the extracted text is sequentially concatenated to form a first initial sample text, and then the second column, the third column and the fourth column can also respectively correspondingly extract the first initial sample text.

Of course, a whole line of cell text may be sequentially extracted from the wired table in a line direction, and a specific extraction manner is not limited.

Optionally, in step S402, noise reduction is performed on each first initial sample text, and non-text characters in each first initial sample text are deleted, so as to obtain a first preprocessed sample text corresponding to each first initial sample text, which may include: performing full-angle half-angle conversion on the first initial sample text, deleting non-text characters in the first initial sample text to obtain a first preprocessed sample text corresponding to the first initial sample text, wherein the non-text characters comprise: preset separators, spaces, hypertext markup language tags, chinese messy codes.

It should be noted that the full angle refers to a character in which one character occupies two character bit types in the text, and the half angle refers to a character in which one character occupies one character bit type in the text. The characters in the first initial sample text that occupy two character positions may be converted to characters that occupy one character position, such that each character in the first initial sample text occupies one character position, to facilitate determining the position of the separator mark in the text.

In addition, deleting the non-text characters in the first initial sample text, where the non-text characters listed in this embodiment may include preset separator, space, hypertext markup language tag, and chinese messy code, where the preset separator may be "/". Since the preset separator is the same mark as the separator mark added to the training sample text, the separator mark belongs to a non-literal character which must be deleted, and other characters which do not belong to the character which must be deleted, including space, hypertext markup language tag and Chinese messy code, can be deleted or not deleted.

Optionally, in the step S403, adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain the first training sample text may include: traversing the texts of at least one cell in sequence, and adding a separation mark between the texts of adjacent cells.

Adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain a first training sample text, and further comprising: adding separation marks between texts of adjacent cells; and a separator mark is added at least at one random position between the text within each cell.

Adding a separation mark to the text of at least one cell in the first preprocessed sample text to obtain a first training sample text, and further comprising: adding separation marks between texts of adjacent cells; word segmentation processing is carried out on the texts in each cell, and word segmentation processing results are obtained; determining at least one complete word in the text in each cell according to the word segmentation processing result; a separator mark is added to at least one complete word.

The manner of adding the separation mark to the training sample text is the same as that of adding the separation mark to the cell text extracted from the file to be processed.

Optionally, in step S403, adding a separation mark to the text of at least one cell in the first preprocessed sample text, to obtain a first training sample text may include: and deleting the text with the preset length from the sample text after the first pretreatment if the character length of the sample text after the first pretreatment after the separation mark is inserted currently meets the preset length or the number of the inserted separation marks meets the preset number, so as to obtain a first training sample text.

In some embodiments, some constraints are also applied to the first training sample text to avoid that the first training sample text is too long beyond what the model can handle.

In one mode, whether the length of the text added with the separation mark reaches a preset length can be judged, and if so, the text after the preset length is discarded. The preset length here may be 512, for example.

Alternatively, it may be determined whether the number of separator marks currently added reaches a predetermined number, and if so, the text after the last separator mark is discarded. The preset number here may be, for example, 10.

For example, assume that the first preprocessed sample text is: today's weather is good/where you want to go to play/we go to bar together, its total length is 21, and the preset length is 10, then the text after the 11 th character is deleted, and the first training sample text obtained is today's weather good/you want to go.

Or the first preprocessed sample text is: today weather is good/where you want to go to play/if we go to bar together, the number of separation markers is 3, and the preset number is 2, then the first training sample text obtained is weather today/where you want to play/if you are good.

The training of the text processing model is described below:

fig. 6 is a network schematic diagram of a text processing model according to an embodiment of the present application. As shown in FIG. 6, the input of the initial text processing model is a first training sample text in the first training sample text set, assuming [ cls ] to reclassify other comprehensive benefits/fair value/measure and its variance to [ sep ], while the labels of each separator mark, assuming [1,0,1], and the position of separator mark in the first training sample text [7,16,13], wherein [ cls ] to reclassify other comprehensive benefits/fair value/measure and its variance to [ sep ] to input1 (first layer), and [7,16,23] to input2 (second layer), and [1,0,1] to input3 (third layer). input1 is input into a 6-layer bert (BidirectionalEncoder Representations from Transformer, transform-based bi-directional encoder) network, and the predicted results p1, p2, p3 of each separator mark predicted by the output layer of the last layer are taken. Wherein p1 refers to the probability that texts indicated by the separation marks with the predicted positions of 7 of the model do not need to be combined, p2 refers to the probability that texts indicated by the separation marks with the predicted positions of 16 of the model do not need to be combined, p3 refers to the probability that texts indicated by the separation marks with the predicted positions of 23 of the model do not need to be combined, cross entropy loss is calculated according to p1, p2, p3 and real labels [1,0,1] of the separation marks and is reversely propagated, parameters of a network are continuously corrected, and finally a text processing model is obtained through training.

In summary, according to the text processing model training method provided by the embodiment, the text processing model is obtained by training the training sample text added with the separation mark, and because the training sample text is marked with the label information and the position information of the separation mark, the label information of the separation mark refers to whether the text at the position of the separation mark needs to be combined or not, and the label information of the separation mark is generated according to the real semantics of the text at the position of the separation mark in the training sample text, the label accuracy is higher, so that the text processing model obtained by training can be used for accurately combining the target processing text based on the label information and the position information of the separation mark marked by the training sample text. The training sample texts can be obtained by serially connecting a plurality of lines of texts, so that the training text processing model can be suitable for merging the plurality of lines of texts, and the efficiency of merging the plurality of lines of texts is improved.

The following describes a device, equipment, a storage medium, etc. for executing the text processing method and the text processing model training method provided in the present application, and specific implementation processes and technical effects of the device and the equipment are referred to above, and are not described in detail below.

Fig. 7 is a schematic diagram of a text processing device according to an embodiment of the present application, where functions implemented by the text processing device correspond to steps performed by the text processing device. The apparatus may be understood as a server as described above, or a processor of a server, or may be understood as a component, which is independent from the server or the processor and performs the functions of the present application under the control of the server, as shown in fig. 7, where the apparatus may include: a reading module 710, a marking module 720, a processing module 730;

a reading module 710, configured to read text of at least one cell in the file to be processed;

a marking module 720, configured to add a separation mark to the text of at least one cell, so as to obtain a target processing text;

the processing module 730 is configured to input the target processing text into a pre-trained text processing model, identify whether the text segmented by each separation mark in the target processing text needs to be combined, and perform a combination process on the target processing text according to the identification result to obtain at least one target text; the text processing model is trained using a training sample text having annotation information comprising labels of separator markers added to the training sample text and locations of the separator markers.

Optionally, the marking module 720 is specifically configured to add a separation mark between the texts of each adjacent cell, so as to obtain the target processing text.

Optionally, the marking module 720 is specifically configured to add a separation mark between the texts of each adjacent cell, and add a separation mark between the texts in each cell, so as to obtain the target processing text.

Optionally, the marking module 720 is specifically configured to insert a separation mark in at least one random position of the text in each cell, so as to obtain the target processing text.

Optionally, the marking module 720 is specifically configured to perform word segmentation on the text in the cell, so as to obtain a word segmentation result;

determining at least one target word from the at least one complete word;

a separator tag is added within each target word.

Fig. 8 is a schematic diagram of a text processing model training device according to an embodiment of the present application, where a function implemented by the text processing model training device corresponds to a step executed by the above text processing model training method. As shown in fig. 8, the apparatus may include: acquisition module 810, training module 820;

The collection module 810 is configured to collect a plurality of first initial sample texts, and pre-process the first initial sample texts to obtain a first sample training text set, where the first sample training text set includes a plurality of first training sample texts, and each first training sample text has labeling information, where the labeling information includes: a label of the separator mark added to the first training sample text, and a position of the separator mark;

the training module 820 is configured to train the text set using the first sample, and train to obtain the text processing model.

Optionally, an acquisition module 810 is specifically configured to

and obtaining a first sample training text set according to each first training sample text.

Optionally, training module 820, particularly for

Acquiring a second sample training text set corresponding to the target field, and labeling the labeling information of each second training sample text in the second sample training text set by a user;

training to obtain a text processing model by using the first training sample text set and the second training sample text set.

Optionally, an acquisition module 810 is specifically configured to

Performing full-angle half-angle conversion on the first initial sample text, deleting non-text characters in the first initial sample text to obtain a first preprocessed sample text corresponding to the first initial sample text, wherein the non-text characters comprise: preset separators, spaces, hypertext markup language tags, chinese messy codes.

Optionally, an acquisition module 810 is specifically configured to

And deleting the text with the preset length from the sample text after the first pretreatment if the character length of the sample text after the first pretreatment after the separation mark is inserted currently meets the preset length or the number of the inserted separation marks meets the preset number, so as to obtain a first training sample text.

The foregoing apparatus is used for executing the method provided in the foregoing embodiment, and its implementation principle and technical effects are similar, and are not described herein again.

The above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digital singnal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

The modules may be connected or communicate with each other via wired or wireless connections. The wired connection may include a metal cable, optical cable, hybrid cable, or the like, or any combination thereof. The wireless connection may include a connection through a LAN, WAN, bluetooth, zigBee, or NFC, or any combination thereof. Two or more modules may be combined into a single module, and any one module may be divided into two or more units. It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the method embodiments, which are not described in detail in this application.

Fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application, where the device may be integrated in a terminal device or a chip of the terminal device, and the terminal may be a computing device with a data processing function.

The device comprises: a processor 801, and a memory 802.

The memory 802 is used for storing a program, and the processor 801 calls the program stored in the memory 802 to execute the above-described method embodiment. The specific implementation manner and the technical effect are similar, and are not repeated here.

Therein, the memory 802 stores program code that, when executed by the processor 801, causes the processor 801 to perform various steps in the methods according to various exemplary embodiments of the present application described in the above section of the description of the exemplary methods.

The processor 801 may be a general purpose processor such as a Central Processing Unit (CPU), digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution.

Memory 802, as a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory may include at least one type of storage medium, which may include, for example, flash Memory, hard disk, multimedia card, card Memory, random access Memory (Random Access Memory, RAM), static random access Memory (Static Random Access Memory, SRAM), programmable Read-Only Memory (Programmable Read Only Memory, PROM), read-Only Memory (ROM), charged erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), magnetic Memory, magnetic disk, optical disk, and the like. The memory is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 802 in the embodiments of the present application may also be circuitry or any other device capable of implementing a memory function for storing program instructions and/or data.

Optionally, the present application also provides a program product, such as a computer readable storage medium, comprising a program for performing the above-described method embodiments when being executed by a processor.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (english: processor) to perform part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: u disk, mobile hard disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk, etc.

Claims

1. A text processing method, comprising:

reading the text of at least one cell in the file to be processed;

2. The method of claim 1, wherein adding a separator mark to the text of the at least one cell results in target processed text, comprising:

3. The method of claim 1, wherein adding a separator mark to the text of the at least one cell results in target processed text, comprising:

4. A method according to claim 3, wherein said adding separator marks between text within each cell comprises:

5. A method according to claim 3, wherein said adding separator marks between text within each cell comprises:

Determining at least one target word from the at least one complete word;

and adding a separation mark in each target word.

6. A method of training a text processing model, the method comprising:

7. The method of claim 6, wherein collecting a plurality of first initial sample texts and preprocessing the first initial sample texts to obtain a first sample training text set, comprises:

8. The method of claim 6, wherein training the text set using the first sample to obtain the text processing model comprises:

9. The method of claim 7, wherein extracting a plurality of first initial sample text from at least one sample file having a preset format comprises:

10. The method of claim 7, wherein the denoising each first initial sample text and deleting non-text characters in each first initial sample text to obtain a first preprocessed sample text corresponding to each first initial sample text, comprises:

11. The method of claim 7, wherein adding a separator mark to the text of at least one cell in the first preprocessed sample text to obtain the first training sample text, comprises:

12. A text processing apparatus, comprising: the device comprises a reading module, a marking module and a processing module;

the processing module is used for inputting the target processing text into a pre-trained text processing model, identifying whether the texts segmented by each separation mark in the target processing text need to be combined or not, and combining the target processing text according to the identification result to obtain at least one target text; the text processing model is obtained by training a training sample text with marking information, the marking information comprises label information of a separation mark added into the training sample text and the position of the separation mark, and the label information is generated based on the real semantics of the text of the position of the separation mark in the training sample text.

13. A text processing model training apparatus, the apparatus comprising: the system comprises an acquisition module and a training module;

14. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the program instructions to perform the steps of the text processing method according to any one of claims 1 to 5 or the steps of the text processing model training method according to any one of claims 6 to 11 when executed.

15. A computer-readable storage medium, characterized in that the storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the text processing method according to any one of claims 1 to 5 or the steps of the text processing model training method according to any one of claims 6 to 11.