CN113205384A

CN113205384A - Text processing method, device, equipment and storage medium

Info

Publication number: CN113205384A
Application number: CN202110507254.2A
Authority: CN
Inventors: 沈广策; 吴建伟; 熊健
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-05-10
Filing date: 2021-05-10
Publication date: 2021-08-03
Anticipated expiration: 2041-05-10
Also published as: CN113205384B

Abstract

The disclosure provides a text processing method, a text processing device, text processing equipment and a storage medium, and relates to the fields of artificial intelligence, natural language processing and big data. The specific implementation scheme is as follows: acquiring a text set to be processed and a structured text set; constructing a first template representation corresponding to each reference structured text; carrying out structuralization processing on each text to be processed to obtain a candidate structuralization text of each text to be processed, and constructing a second template representation corresponding to the candidate structuralization text; and matching the first template representation and the second template representation, determining the second template representation corresponding to the template matching result meeting the preset condition, and adding the candidate structured text corresponding to the determined second template representation to the structured text set. According to the technology disclosed by the invention, the processing efficiency of the natural language text and the extraction precision of the structured information are improved, and the labor cost is reduced.

Description

Text processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of data processing technology, and in particular, to the fields of artificial intelligence, natural language processing, and big data.

Background

In the related art, the extraction processing of the structured information of the natural language text, especially in the application scenario of building a commodity category system according to commodity information in the field of commercial promotion, is usually built by combining customer expression and artificial labeling, and has the defects of high artificial labeling cost and low processing efficiency in the face of hundreds of thousands to millions of data volumes.

Disclosure of Invention

The disclosure provides a text processing method, a text processing device, a text processing apparatus and a storage medium.

According to an aspect of the present disclosure, there is provided a text processing method including:

acquiring a text set to be processed and a structured text set; the text set to be processed comprises a plurality of texts to be processed, and the structured text set comprises a plurality of reference structured texts;

constructing a first template representation corresponding to each reference structured text; carrying out structuralization processing on each text to be processed to obtain a candidate structuralization text of each text to be processed, and constructing a second template representation corresponding to the candidate structuralization text;

and matching the first template representation and the second template representation, determining the second template representation corresponding to the template matching result meeting the preset condition, and adding the candidate structured text corresponding to the determined second template representation to the structured text set.

According to another aspect of the present disclosure, there is provided a training method of a text processing model, including:

determining a target structured text by using a text sample to be processed;

inputting a text sample to be processed into a text processing model to be trained to obtain a prediction structured text;

and training the text processing model to be trained according to the difference between the target structured text and the prediction structured text until the difference is within an allowable range.

According to another aspect of the present disclosure, there is provided a text processing apparatus including:

the text set acquisition module is used for acquiring a text set to be processed and a structured text set; the text set to be processed comprises a plurality of texts to be processed, and the structured text set comprises a plurality of reference structured texts;

the template representation construction module is used for constructing a first template representation corresponding to each reference structured text; the text processing method is used for performing structuralization processing on each text to be processed to obtain a candidate structuralization text of each text to be processed, and constructing a second template representation corresponding to the candidate structuralization text;

and the matching module is used for matching the first template representation and the second template representation, determining the second template representation corresponding to the template matching result meeting the preset condition, and adding the candidate structured text corresponding to the determined second template representation to the structured text set.

According to another aspect of the present disclosure, there is provided a training apparatus for a text processing model, including:

the target structured text determining module is used for determining a target structured text by utilizing the text sample to be processed;

the prediction structured text acquisition module is used for inputting the text sample to be processed into the text processing model to be trained to obtain a prediction structured text;

and the training module is used for training the text processing model to be trained according to the difference between the target structured text and the prediction structured text until the difference is within an allowable range.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method in any of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method in any of the embodiments of the present disclosure.

According to the technology disclosed by the invention, the text processing efficiency under mass text data is improved, and the labor cost of manual processing is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 shows a flow diagram of a text processing method according to an embodiment of the present disclosure;

fig. 2 shows a specific flowchart of template matching in the text processing method according to the embodiment of the present disclosure;

FIG. 3 illustrates a detailed flow chart of determining a second template representation in a text processing method according to an embodiment of the disclosure;

fig. 4 shows a specific flowchart for template matching in the text processing method according to the embodiment of the present disclosure;

FIG. 5 illustrates a detailed flow chart for constructing a first template representation in a text processing method according to an embodiment of the disclosure;

FIG. 6 illustrates a detailed flow chart for constructing a first template representation in a text processing method according to an embodiment of the disclosure;

fig. 7 shows a specific flowchart for obtaining candidate structured texts in the text processing method according to the embodiment of the present disclosure;

FIG. 8 illustrates a detailed flow chart for constructing a second template representation in a text processing method according to an embodiment of the disclosure;

FIG. 9 is a detailed flowchart of constructing a second template representation in a text processing method according to an embodiment of the disclosure;

FIG. 10 illustrates a detailed flow chart for constructing a second template representation in a text processing method according to an embodiment of the disclosure;

fig. 11 shows a specific flowchart for acquiring a to-be-processed text set and a structured text set in the text processing method according to the embodiment of the present disclosure;

FIG. 12 shows a flow diagram of a method of training a text processing model according to an embodiment of the present disclosure;

FIG. 13 shows a schematic diagram of a text processing apparatus according to an embodiment of the present disclosure;

FIG. 14 shows a schematic diagram of a training apparatus for a text processing model according to an embodiment of the present disclosure;

FIG. 15 is a block diagram of an electronic device for implementing a text processing method of an embodiment of the present disclosure;

fig. 16 is a diagram of a specific example of a training method of a text processing model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

A text processing method according to an embodiment of the present disclosure is described below with reference to fig. 1 to 11.

As shown in fig. 1, a method according to an embodiment of the present disclosure includes the steps of:

s101: acquiring a text set to be processed and a structured text set; the text set to be processed comprises a plurality of texts to be processed, and the structured text set comprises a plurality of reference structured texts;

s102: constructing a first template representation corresponding to each reference structured text; carrying out structuralization processing on each text to be processed to obtain a candidate structuralization text of each text to be processed, and constructing a second template representation corresponding to the candidate structuralization text;

s103: and matching the first template representation and the second template representation, determining the second template representation corresponding to the template matching result meeting the preset condition, and adding the candidate structured text corresponding to the determined second template representation to the structured text set.

Exemplarily, in step S101, the text to be processed may be a natural language text. The reference structured text can be used for extracting the structured information of each text to be processed on the basis of the text set to be processed to obtain the structured text information corresponding to each text to be processed, and the structured text information is used as the reference structured text and constructs the structured text set.

It will be appreciated that, on the basis of the above example, for any one of the set of structured texts with reference to a reference structured text, there is at least one text to be processed in the set of text to be processed corresponding to the reference structured text.

Illustratively, in step S102, the first template representation corresponding to the reference structured text may be constructed in a manner that the reference structured text is combined with the above prefix and/or the below suffix.

For example, the reference structured text is "facial weight loss and fat absorption", and the first template representation may be "how the facial weight loss and fat absorption will rebound", that is, a construction mode combining the reference structured text and the "suffix below" is adopted; or the method can be a construction mode combining the prefix of the upper part, the reference structured text and the suffix of the lower part, namely the method is used for 'which hospital wants to do facial weight loss and fat absorption is good'.

Similarly, the second template representation corresponding to the candidate structured text may adopt the same or similar construction manner as the first template representation, and is not described herein again.

It should be noted that, the processing method for performing the structuring processing on the text to be processed to obtain the candidate structured text may adopt a processing method different from the processing method for generating the reference structured text, so as to expand the structured text set on the basis of the plurality of reference structured texts.

For example, the text to be processed may be word-cut to obtain a plurality of word units. And based on the word units, selecting N word units from the word units for recombination to generate a language segment, and obtaining a candidate structured text.

For example, in step S103, template matching is performed on the first template representation and the second template representation, and matching and filtering may be performed according to a correlation between the first template representation and the second template representation, so as to determine, from the multiple candidate structure texts, the second template representation corresponding to the template matching result meeting the preset condition.

In a specific example, the supplementary and expansion of the structured text set can be realized by calculating semantic similarity between the first template representation and the second template representation, determining the second template representation corresponding to the template matching result meeting the preset condition based on the semantic similarity, and adding a candidate structured text corresponding to the second template representation as a target structured text to the structured text set.

The method of the embodiment of the present disclosure may be applied to a business promotion marketing scenario, and the method according to the embodiment of the present disclosure is described below with reference to a specific application scenario.

In the application scenario, the text to be processed is service information issued by a service owner, and the service information is a natural language text.

First, a service information set and a seed service set are obtained. And constructing a service information set as a to-be-processed text set based on a plurality of service information issued by different service owners. And aiming at each service information, performing preliminary structured information extraction processing to obtain a seed service point corresponding to each service information, and adding the seed service point to a target service point set.

Then, for each seed service point, a first template representation corresponding to the seed service point is constructed, wherein the first template representation is a templated expression comprising the seed service point, and is constructed in a manner of the seed service point and the prefix above and/or suffix below.

And structuring the service information to obtain candidate service points corresponding to the service information, and constructing a second template representation corresponding to the candidate service points. Wherein the second template representation may be constructed in the same or similar way as the first template representation.

And finally, respectively carrying out template matching on the second template representation corresponding to the candidate service points and the first template representation corresponding to each seed service point aiming at each candidate service point, and taking the candidate service points as target service points and adding the target service points to a target service point set under the condition that the template matching result meets the preset condition, thereby completing the construction of a service category system under the service promotion marketing scene.

According to the method disclosed by the embodiment of the invention, the further excavation of the structured information of the text to be processed is realized, so that the structured text set is expanded and supplemented on the basis of referring to the structured text, and the extraction precision of the structured information processing of the natural language text is improved. Moreover, the candidate structured texts are filtered and screened in a template matching mode, so that the text processing efficiency under massive text data can be met, and the cost of manpower marking is reduced.

As shown in FIG. 2, in one embodiment, the template matching result includes semantic similarity and support; step S103 may include the steps of:

s201: calculating semantic similarity between a second template representation corresponding to the candidate structured text and each first template representation aiming at each candidate structured text;

s202: and calculating the number of first template representations corresponding to the semantic similarity meeting the semantic similarity threshold based on the semantic similarity to obtain the support of the candidate structured text.

The semantic similarity threshold satisfies the semantic similarity, which can be understood as that the semantic similarity is greater than or equal to the semantic similarity threshold. The semantic similarity threshold may be specifically set according to actual conditions.

Illustratively, the semantic similarity between the second template representation and the first template representation may be obtained by calculating at least one of a cosine similarity, an euclidean distance, a manhattan distance, and a Jaccard similarity Coefficient (Jaccard Coefficient) between the two.

And for each second template representation, determining the number of the first template representations corresponding to the semantic similarity with the semantic similarity larger than a semantic similarity threshold according to the semantic similarity between the second template representation and the first template representation corresponding to each reference structured text so as to obtain the support degree of the candidate structured text. In other words, the support of the candidate structured text is used for characterizing, and of all the first template representations, the number of the first template representations having semantic similarity with the second template representation greater than the semantic similarity threshold value.

According to the embodiment, the semantic similarity between the second template representation corresponding to the candidate structured text and each first template representation is calculated, and the support degree of the candidate structured text is calculated according to the semantic similarity, so that the second template representation and the second template representation can be matched from the perspective of the semantic similarity and the number, and the matching accuracy is improved.

As shown in fig. 3, in an embodiment, determining a second template representation corresponding to the template matching result meeting the preset condition, step S103 may further include the following steps:

s301: and under the condition that the support degree of the candidate structured text meets the support degree threshold, determining the second template representation corresponding to the candidate structured text as the second template representation corresponding to the template matching result meeting the preset condition.

The support degree meets the support degree threshold, which is understood to be greater than or equal to the support degree threshold. Wherein, the support threshold can be specifically set according to the actual situation. For example, the support threshold may be set to 5, and for a certain candidate structured text, in the case that the number of first template representations whose semantic similarity of the second template representation corresponding to the candidate structured text satisfies the semantic similarity threshold is greater than or equal to 5, the candidate structured text is added to the structured text set.

Through the embodiment, the template matching accuracy can be improved, and the finally determined candidate structured texts are ensured to have higher correlation compared with the reference structured texts, so that the structured texts added to the structured text set have higher standardization and uniformity.

As shown in fig. 4, in one embodiment, step S102 includes the steps of:

s401: aiming at each reference structured text, acquiring a text to be processed which is matched with the reference structured text in a text set to be processed;

s402: and constructing a first template representation corresponding to the reference structured text based on the text to be processed matched with the reference structured text.

For example, the text to be processed that matches the reference structured text may be obtained by keyword matching. Specifically, based on the reference structured text, the set of texts to be processed is traversed, and the texts to be processed matched with the literal information of the reference structured text are found.

For example, referring to the structured text as "facial weight loss and fat absorption", by keyword matching, a plurality of texts to be processed including "facial weight loss and fat absorption" are searched from the text set to be processed: the text to be processed is matched with the reference structured text, namely, the rebound of the facial weight-reducing liposuction, the good of the facial weight-reducing liposuction hospital, the good of the Beijing facial weight-reducing liposuction hospital and the like.

Based on each text to be processed that matches the reference structured text, a first template representation corresponding to the reference structured text is constructed. That is, when the number of the acquired first template representations matching the reference structured text is plural, the number of the first template representations of the constructed reference structured text is plural.

According to the embodiment, the first template representation corresponding to the reference structured text is constructed by using the text to be processed matched with the reference structured text, the obtained first template representation is relatively consistent with the natural language expression mode of the text to be processed, and other corpora do not need to be referred to, so that the construction difficulty of the first template representation is reduced.

As shown in fig. 5, in one embodiment, step S402 includes the steps of:

s501: acquiring the page browsing amount of the text to be processed aiming at the text to be processed matched with the reference structured text;

s502: and under the condition that the page browsing amount meets a preset condition, constructing a first template representation corresponding to the reference structured text.

In step S501, in a service promotion marketing scenario, the text to be processed is historical service information issued by different service owners, and the page browsing volume of the text to be processed is the page browsing volume of the historical service information.

Illustratively, in step S502, the text to be processed matching the reference structured text is filtered based on the page browsing amount of the text to be processed matching the reference structured text. The preset condition may be a preset page browsing amount threshold, and when the page browsing amount of the text to be processed matched with the reference structured text is smaller than the page browsing amount threshold, the text to be processed is filtered, and the text to be processed corresponding to the page browsing amount larger than or equal to the page browsing amount threshold is reserved as a basis for constructing the first template representation corresponding to the reference structured text. The page browsing amount threshold value can be specifically set according to actual conditions.

By the embodiment, the text to be processed matched with the reference structured text is filtered based on the page browsing amount of the text to be processed, so that the text to be processed according to which the first template representation is constructed has a certain attention, and the commercial reference value of the first template representation corresponding to the reference structured text is improved.

As shown in fig. 6, step S402 may further include the steps of:

s601: taking the reference structured text as a first slot unit;

s602: determining a first text unit matched with the reference structured text in the text to be processed matched with the reference structured text;

s603: and replacing the first text unit with a first slot unit in the text to be processed matched with the reference structured text to obtain a first template representation corresponding to the reference structured text.

In one specific example, the reference structured text is "facial fat loss liposuction" and the reference structured text is taken as the first slot cell. The text to be processed matched with the reference structured text can comprise 'do rebound for facial weight loss and fat absorption', 'good hospital for Beijing facial weight loss and fat absorption', 'good hospital for facial weight loss and fat absorption' and 'good hospital for facial weight loss and fat absorption', and a first text unit, namely 'facial weight loss and fat absorption', which is matched with the character face of the reference structured text and the specific position of the first text unit are determined based on the text to be processed. Then, replacing the first text unit with a first slot unit [ slot ] in the text to be processed to obtain a first template representation corresponding to the reference structured text, namely: { w 1: "[ slot ] will rebound", w 2: "[ territorial word ] [ slot ] which hospital is good", w 3: "[ slot ] which is good", w 4: "what hospital is good at slot", where w 1-w 4 are all first template representations corresponding to the reference structured text "facial weight loss and liposuction".

As shown in fig. 7, in one embodiment, step S102 includes the steps of:

s701: and inputting each text to be processed in the text set to be processed into the trained language model, and performing word segmentation and recombination to obtain a candidate structured text of the text to be processed.

It is to be understood that the language model in step S701 may adopt various language models capable of implementing word segmentation and recombination processing on the text to be processed.

Illustratively, the language model may employ an N-gram language model. The N-gram language model is also called an N-gram language model, and the basic idea of the model is to recombine a plurality of byte units (grams) contained in the literal content of the text to be processed into a language segment (N-gram) containing N byte units. And counting the occurrence frequency of each language fragment based on the obtained plurality of language fragments, and outputting the language fragment with the highest occurrence frequency. And the language segment with the highest occurrence frequency output by the N-gram can be used as the candidate structured text.

Exemplarily, when N ═ 1, that is, a language fragment contains only one byte unit, the resulting language fragment is a unigram language fragment (unigram); when N ═ 2, that is, when a language fragment contains two byte units, the resulting language fragment is a binary language fragment (bigram); when N ═ 3, that is, when a language fragment contains three byte units, the resulting language fragment is a ternary language fragment (trigram). Wherein, N can be specifically set according to actual conditions.

In one specific example, the pending text is "how much money was trained in Beijing Yasi English," and the pending text is input to the trained N-gram language model. Firstly, through word segmentation processing, a plurality of byte units, namely 'Beijing', 'Yasi', 'English', 'training' and 'how much money' are obtained. And finally, outputting the N-gram language segment with the highest frequency of occurrence as a candidate structured text corresponding to the text to be processed. For example, in the case where N ═ 1, the resulting unary language fragment is "yasi"; in the case of N ═ 2, the resulting binary language fragment is "english jacobi" or "english training"; in the case where N is 3, the resulting ternary language fragment is "yasi english training".

It should be noted that the N-gram language model can be trained by those skilled in the art using various methods known or known in the future.

According to the embodiment, the candidate structured text of the text to be processed is constructed by using the language model, so that on one hand, the text to be processed can be further processed on the basis of the acquired reference structured text to generate the candidate structured text which is used as the candidate supplementary text for expanding the structured text set; on the other hand, the construction difficulty of the candidate structured text is reduced, and the construction efficiency of the candidate structured text is improved.

As shown in fig. 8, in one embodiment, step S102 includes the steps of:

s801: aiming at each candidate structured text, acquiring a text to be processed which is matched with the candidate structured text in a text set to be processed;

s802: and constructing a second template representation corresponding to the candidate structured text based on the to-be-processed text matched with the candidate structured text in the to-be-processed text set.

For example, the text to be processed that matches the candidate structured text may be obtained by keyword matching. Specifically, based on the candidate structured texts, the set of texts to be processed is traversed, and the texts to be processed, which are matched with the literal information of the candidate structured texts, are found.

For example, the candidate structured text is "shoulder liposuction", and a plurality of texts to be processed containing "shoulder liposuction" are searched from the text set to be processed through keyword matching: the text to be processed is matched with the candidate structured text, such as 'the shoulder fat absorption will rebound' and 'which hospital is good for shoulder fat absorption', and the like.

And constructing a second template representation corresponding to the candidate structured text based on each text to be processed matched with the candidate structured text. That is, when the number of the acquired second template representations matching the candidate structured text is plural, the number of the second template representations corresponding to the constructed candidate structured text is plural.

According to the embodiment, the first template representation corresponding to the candidate structured text is constructed by using the text to be processed matched with the candidate structured text, and the template construction form of the second template representation is the same as that of the first template representation, so that a basis is provided for performing template matching on the first template representation and the second template representation subsequently, other corpora do not need to be referred, and the construction difficulty of the second template representation is reduced.

As shown in fig. 9, in one embodiment, step S802 includes the following steps:

s901: acquiring relevant parameters of the text to be processed based on the text to be processed matched with the candidate structured text in the text set to be processed, wherein the relevant parameters comprise page browsing amount and/or transaction records of the text to be processed;

s902: and under the condition that the relevant parameters of the text to be processed meet the preset conditions, constructing a second template representation corresponding to the candidate structured text.

Exemplarily, in step S901, in a service promotion marketing scenario, the text to be processed is historical service information issued by different service owners, and the page browsing volume of the text to be processed is the page browsing volume and/or the transaction record of the historical service information.

Illustratively, in step S902, the text to be processed matching the candidate structured text is filtered based on the page browsing amount of the text to be processed matching the candidate structured text. The preset condition can be a preset page browsing amount threshold and/or a transaction number threshold, and the text to be processed is filtered under the condition that the page browsing amount of the text to be processed matched with the candidate structured text is smaller than the page browsing amount threshold; and/or filtering the text to be processed under the condition that the transaction record of the text to be processed matched with the candidate structured text is smaller than the transaction quantity threshold value, and keeping the text to be processed which is larger than or equal to the page browsing quantity threshold value and/or the transaction quantity threshold value as a reference for constructing a second template representation corresponding to the candidate structured text. The page browsing amount threshold value can be specifically set according to actual conditions.

Through the embodiment, the text to be processed, which is matched with the candidate structured text, can be filtered, wherein the text to be processed has low attention and/or low transaction amount, so that the finally determined candidate structured text is ensured to have certain commercial value.

As shown in fig. 10, in one embodiment, step S802 includes the steps of:

s1001: taking the candidate structured text as a second slot unit;

s1002: acquiring a second text unit matched with the candidate structured text in the text to be processed matched with the candidate structured text

S1003: and replacing the second text unit with a second slot unit in the text to be processed matched with the candidate structured text to obtain a second template representation corresponding to the candidate structured text.

In one specific example, the candidate structured text is "shoulder liposuction," which is taken as the second slot cell. The text to be processed matched with the candidate structured text can comprise 'do not rebound in shoulder liposuction', 'good hospital in Beijing shoulder liposuction', 'good hospital in shoulder liposuction' and 'good hospital in shoulder liposuction wanting', and a second text unit, namely 'shoulder liposuction', which is literally matched with the candidate structured text and the specific position of the second text unit are determined based on the text to be processed. Then, replacing a second text unit with a second slot unit [ slot ] in the text to be processed to obtain a second template representation corresponding to the candidate structured text, namely: { w 1: "[ slot ] will rebound", w 2: "[ territorial word ] [ slot ] which hospital is good", w 3: "[ slot ] which is good", w 4: "what hospital is good to do [ slot" }, where w 1-w 4 are all first template representations corresponding to the candidate structured text "shoulder liposuction".

As shown in fig. 11, in one embodiment, step S101 includes the steps of:

s1101: acquiring a text set to be processed;

s1102: extracting a reference structured text corresponding to the text to be processed by utilizing the trained text processing model based on each text to be processed;

s1103: based on the reference structured text, a structured text set is constructed.

For example, in step S1101, for a structured text extraction task in a business promotion marketing scenario, a text set to be processed may be constructed by acquiring a plurality of pieces of business information issued by different business owners.

Illustratively, in step S1102, the text processing model may be a sequence annotation model, and more specifically, the sequence annotation model may employ a named entity recognition model. The method comprises the steps of inputting a text to be processed into a trained named entity recognition model, carrying out named entity recognition on the text to be processed, and taking an output named entity as a reference structured text corresponding to the text to be processed. The named entity recognition Model may specifically adopt a maximum entropy Model, a Conditional Random Field (CRF) Model, a Hidden Markov Model (HMM), a neural network, or other models.

In a specific example, taking a conditional random field model as an example, after a text to be processed is input into a trained conditional random field model, labeling each byte unit in the text to be processed by using a BIO labeling method, filtering the byte unit labeled as "O" according to a labeling result, and combining the byte units labeled as "B" and "I" into a byte segment and outputting the byte segment. Where "B" and "I" denote the beginning and middle of a noun phrase, respectively, and "O" denotes a phrase that is not a noun phrase.

Through the embodiment, the trained text processing model is utilized, the reference structured text corresponding to the text to be processed can be automatically acquired, the text to be processed does not need to be manually marked, the labor labeling cost is saved, and the acquisition efficiency and the extraction accuracy of the reference structured text are improved.

According to an embodiment of the present disclosure, the present disclosure further provides a training method of a text processing model.

As shown in fig. 12, the training method specifically includes the following steps:

s1201: determining a target structured text by using a text sample to be processed;

s1202: inputting a text sample to be processed into a text processing model to be trained to obtain a prediction structured text;

s1203: and training the text processing model to be trained according to the difference between the target structured text and the prediction structured text until the difference is within an allowable range.

In the embodiment of the present disclosure, the text processing model may be a sequence labeling model, and more specifically, the sequence labeling model may adopt a named entity recognition model. The named entity recognition Model may specifically adopt a maximum entropy Model, a Conditional Random Field (CRF) Model, a Hidden Markov Model (HMM), a neural network, or other models.

For example, in step S1201, the text sample to be processed may be matched with a pre-established encyclopedia library, and labeled by combining with a BIO labeling method, so as to determine a target structured sample.

Illustratively, in steps S1202 and S1203, the difference between the predicted structured text and the target structured text may be obtained by means of manual evaluation. And for the target structured text with the unconfirmed manual evaluation result, constructing the target structured text in a manual labeling mode, and inputting the corresponding text sample to be processed into the text processing model again for training again. And obtaining the trained text processing model through multi-round iterative tuning.

The following describes a training method of a text processing model according to an embodiment of the present disclosure in a specific example with reference to fig. 16, in this example, the text processing model is a CRF model.

As shown in fig. 16, the text sample to be processed is matched with the encyclopedic entries in the encyclopedic library, and the training corpus is obtained by the BIO labeling method, so as to determine the target structured sample. For example, the matched encyclopedia entry is "blessing examination", the to-be-processed text sample (i.e., auction word) is "where the beijing blessing examination is registered", the to-be-processed text sample is labeled by the BIO labeling method, and the obtained labeling result is "beijing: o "," Tufu: b "," examination: i "," where: o "and" entry: and O' is adopted. Wherein "blessing" and "exam" are target structured samples.

And then, inputting the text sample to be processed into a CRF model to be trained to obtain a prediction structured text.

And finally, evaluating the predicted structured text in a manual or machine mode to obtain the difference between the predicted structured text and the target structured text. If the difference is not within the preset range, the incredible prediction structured text is re-labeled in a mode of model labeling or manual labeling, and is input into the text processing model for training, and the trained text processing model is obtained through multiple iterations.

Based on the above example, for the acquisition of the target structured text, manual labeling is not needed, so that the labor cost is saved, the labeling time of the training corpus is shortened, and the training efficiency of the text processing model is improved.

According to an embodiment of the present disclosure, the present disclosure also provides a text processing apparatus.

As shown in fig. 13, the apparatus includes:

a text set obtaining module 1301, configured to obtain a text set to be processed and a structured text set; the text set to be processed comprises a plurality of texts to be processed, and the structured text set comprises a plurality of reference structured texts;

a template representation construction module 1302, configured to construct a first template representation corresponding to each reference structured text; the text processing method is used for performing structuralization processing on each text to be processed to obtain a candidate structuralization text of each text to be processed, and constructing a second template representation corresponding to the candidate structuralization text;

and the matching module 1303 is configured to match the first template representation and the second template representation, determine a second template representation corresponding to a template matching result meeting a preset condition, and add a candidate structured text corresponding to the determined second template representation to the structured text set.

In one embodiment, the template matching result comprises semantic similarity and support; the matching module 1303 includes:

the semantic similarity operator module is used for calculating the semantic similarity between the second template representation and each first template representation corresponding to each candidate structured text;

and the support degree operator module is used for calculating the number of first template representations corresponding to the semantic similarity meeting the semantic similarity threshold based on the semantic similarity to obtain the support degree of the candidate structured text.

In one embodiment, the matching module 1303 further includes:

and the second template representation determining unit is used for determining the second template representation corresponding to the candidate structured text as the second template representation corresponding to the template matching result meeting the preset condition under the condition that the support degree of the candidate structured text meets the support degree threshold value.

In one embodiment, the template representation building module 1302 includes:

the text matching sub-module is used for acquiring the texts to be processed which are matched with the reference structured texts in the text set to be processed aiming at each reference structured text;

and the first template representation construction sub-module is used for constructing a first template representation corresponding to the reference structured text based on the text to be processed matched with the reference structured text.

In one embodiment, the first template representation construction sub-module comprises:

the page browsing amount acquisition unit is used for acquiring the page browsing amount of the text to be processed aiming at the text to be processed matched with the reference structured text;

and the first template representation construction unit is used for constructing a first template representation corresponding to the reference structured text under the condition that the page browsing amount meets a preset condition.

a first slot unit determining unit, configured to use the reference structured text as a first slot unit;

the first text unit determining unit is used for determining a first text unit matched with the reference structured text in the text to be processed matched with the reference structured text;

and the first template representation construction unit is used for replacing the first text unit with a first slot unit in the text to be processed matched with the reference structured text to obtain a first template representation corresponding to the reference structured text.

In one embodiment, the template representation building module 1302 includes:

and the candidate structured text construction sub-module is used for inputting each text to be processed in the text set to be processed into the trained language model, and performing word segmentation and recombination processing to obtain a candidate structured text of the text to be processed.

In one embodiment, the template representation building module 1302 includes:

the text matching sub-module is used for acquiring the texts to be processed which are matched with the candidate structured texts in the text set to be processed aiming at each candidate structured text;

and the second template representation construction sub-module is used for constructing a second template representation corresponding to the candidate structured text based on the to-be-processed text matched with the candidate structured text in the to-be-processed text set.

In one embodiment, the second template representation construction sub-module comprises:

the parameter acquisition unit is used for acquiring relevant parameters of the text to be processed based on the text to be processed matched with the candidate structured text in the text set to be processed, wherein the relevant parameters comprise the page browsing amount and/or the transaction record of the text to be processed;

and the second template representation construction unit is used for constructing a second template representation corresponding to the candidate structured text under the condition that the relevant parameters of the text to be processed meet the preset conditions.

a second slot determining unit, configured to use the candidate structured text as a second slot unit;

the second text unit determining unit is used for acquiring a second text unit matched with the candidate structured text in the text to be processed matched with the candidate structured text;

and the second template representation construction unit is used for replacing the second text unit with a second slot unit in the text to be processed matched with the candidate structured text to obtain a second template representation corresponding to the candidate structured text.

In one embodiment, the text set obtaining module 1301 includes:

the to-be-processed text set acquisition submodule is used for acquiring a to-be-processed text set;

the reference structured text extraction submodule is used for extracting a reference structured text corresponding to the text to be processed by utilizing a trained text processing model based on each text to be processed;

and the structured text set constructing submodule is used for constructing a structured text set based on the reference structured text.

As shown in fig. 14, the apparatus includes:

a target structured text determining module 1401, configured to determine a target structured text by using the text sample to be processed;

a predicted structured text obtaining module 1402, configured to input a text sample to be processed into a text processing model to be trained, so as to obtain a predicted structured text;

a training module 1403, configured to train the text processing model to be trained according to the difference between the target structured text and the predicted structured text until the difference is within an allowable range.

The functions of each unit, module or sub-module in each apparatus in the embodiments of the present disclosure may refer to the corresponding description in the above method embodiments, and are not described herein again.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 15 shows a schematic block diagram of an example electronic device 1500 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 15, the electronic device 1500 includes a calculation unit 1501 which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1502 or a computer program loaded from a storage unit 1508 into a Random Access Memory (RAM) 1503. In the RAM 1503, various programs and data necessary for the operation of the electronic device 1500 can also be stored. The calculation unit 1501, the ROM 1502, and the RAM 1503 are connected to each other by a bus 1504. An input/output (I/O) interface 1505 is also connected to bus 1504.

Various components in the electronic device 1500 connect to the I/O interface 1505, including: an input unit 1506 such as a keyboard, a mouse, and the like; an output unit 1507 such as various types of displays, speakers, and the like; a storage unit 1508, such as a magnetic disk, optical disk, or the like; and a communication unit 1509 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1509 allows the electronic device 1500 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1501 may be various general and/or special purpose processing components having processing and computing capabilities. Some examples of the computation unit 1501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computation chips, various computation units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The calculation unit 1501 executes the respective methods and processes described above, such as a text processing method. For example, in some embodiments, the text processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1500 via the ROM 1502 and/or the communication unit 1509. When the computer program is loaded into the RAM 1503 and executed by the computing unit 1501, one or more steps of the text processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1501 may be configured to perform the text processing method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A text processing method, comprising:

and matching the first template representation and the second template representation, determining a second template representation corresponding to a template matching result meeting a preset condition, and adding a candidate structured text corresponding to the determined second template representation to the structured text set.

2. The method of claim 1, wherein the template matching result comprises semantic similarity and support;

said matching the first template representation and the second template representation comprises:

for each candidate structured text, calculating semantic similarity between a second template representation corresponding to the candidate structured text and each first template representation;

and calculating the number of first template representations corresponding to the semantic similarity meeting a semantic similarity threshold based on the semantic similarity to obtain the support of the candidate structured text.

3. The method according to claim 2, wherein the determining the second template representation corresponding to the template matching result meeting the preset condition includes:

and under the condition that the support degree of the candidate structured text meets a support degree threshold, determining the second template representation corresponding to the candidate structured text as the second template representation corresponding to the template matching result meeting the preset condition.

4. The method of claim 1, wherein said constructing a first template representation corresponding to each of said reference structured texts comprises:

aiming at each reference structured text, acquiring a text to be processed which is matched with the reference structured text in the text set to be processed;

and constructing a first template representation corresponding to the reference structured text based on the text to be processed matched with the reference structured text.

5. The method of claim 4, wherein the constructing a first template representation corresponding to the reference structured text based on the text to be processed that matches the reference structured text comprises:

acquiring the page browsing amount of the text to be processed aiming at the text to be processed matched with the reference structured text;

and under the condition that the page browsing amount meets a preset condition, constructing a first template representation corresponding to the reference structured text.

6. The method of claim 4, wherein the constructing a first template representation corresponding to the reference structured text based on the text to be processed that matches the reference structured text comprises:

taking the reference structured text as a first slot unit;

determining a first text unit matched with the reference structured text in the text to be processed matched with the reference structured text;

and replacing the first text unit with the first slot unit in the text to be processed matched with the reference structured text to obtain a first template representation corresponding to the reference structured text.

7. The method according to claim 1, wherein the structuring each text to be processed to obtain a candidate structured text of each text to be processed comprises:

and inputting each text to be processed in the text set to be processed into a trained language model, and performing word segmentation and recombination to obtain a candidate structured text of the text to be processed.

8. The method of claim 1, wherein said constructing a second template representation to which the candidate structured text corresponds comprises:

aiming at each candidate structured text, acquiring a text to be processed which is matched with the candidate structured text in the text set to be processed;

and constructing a second template representation corresponding to the candidate structured text based on the text to be processed matched with the candidate structured text in the text set to be processed.

9. The method of claim 8, wherein the constructing a second template representation corresponding to the candidate structured text based on the to-be-processed text in the to-be-processed text set that matches the candidate structured text comprises:

acquiring relevant parameters of the text to be processed based on the text to be processed matched with the candidate structured text in the text set to be processed, wherein the relevant parameters comprise page browsing amount and/or transaction records of the text to be processed;

and under the condition that the relevant parameters of the text to be processed meet preset conditions, constructing a second template representation corresponding to the candidate structured text.

10. The method of claim 8, wherein the constructing a second template representation corresponding to the candidate structured text based on the to-be-processed text in the to-be-processed text set that matches the candidate structured text comprises:

taking the candidate structured text as a second slot unit;

acquiring a second text unit matched with the candidate structured text in the text to be processed matched with the candidate structured text;

and replacing the second text unit with the second slot unit in the text to be processed matched with the candidate structured text to obtain a second template representation corresponding to the candidate structured text.

11. The method of any of claims 1 to 10, wherein the obtaining the set of text to be processed and the set of structured text comprises:

acquiring the text set to be processed;

extracting a reference structured text corresponding to the text to be processed by utilizing a trained text processing model based on each text to be processed;

and constructing the structured text set based on the reference structured text.

12. A training method of a text processing model comprises the following steps:

determining a target structured text by using a text sample to be processed;

inputting the text sample to be processed into a text processing model to be trained to obtain a prediction structured text;

and training the text processing model to be trained according to the difference between the target structured text and the predicted structured text until the difference is within an allowable range.

13. A text processing apparatus comprising:

the template representation construction module is used for constructing a first template representation corresponding to each reference structured text; the text processing device is used for performing structuralization processing on each text to be processed to obtain a candidate structuralization text of each text to be processed, and constructing a second template representation corresponding to the candidate structuralization text;

and the matching module is used for matching the first template representation and the second template representation, determining a second template representation corresponding to a template matching result meeting a preset condition, and adding a candidate structured text corresponding to the determined second template representation to the structured text set.

14. The apparatus of claim 13, wherein the template matching result comprises semantic similarity and support;

the matching module includes:

the semantic similarity operator module is used for calculating the semantic similarity between a second template representation corresponding to each candidate structured text and each first template representation according to each candidate structured text;

15. The apparatus of claim 14, wherein the matching module further comprises:

16. The apparatus of claim 13, wherein the template representation construction module comprises:

17. The apparatus of claim 16, wherein the first template representation construction sub-module comprises:

18. The apparatus of claim 16, wherein the first template representation construction sub-module comprises:

and the first template representation construction unit is used for replacing the first text unit with the first slot unit in the text to be processed matched with the reference structured text to obtain a first template representation corresponding to the reference structured text.

19. The apparatus of claim 13, wherein the template representation construction module comprises:

20. The apparatus of claim 13, wherein the template representation construction module comprises:

and the second template representation construction sub-module is used for constructing a second template representation corresponding to the candidate structured text based on the text to be processed matched with the candidate structured text in the text set to be processed.

21. The apparatus of claim 20, wherein the second template representation construction sub-module comprises:

and the second template representation construction unit is used for constructing a second template representation corresponding to the candidate structured text under the condition that the relevant parameters of the text to be processed meet preset conditions.

22. The apparatus of claim 20, wherein the second template representation construction sub-module comprises:

and the second template representation construction unit is used for replacing the second text unit with the second slot unit in the text to be processed matched with the candidate structured text to obtain a second template representation corresponding to the candidate structured text.

23. The apparatus of any of claims 13 to 22, wherein the text set acquisition module comprises:

the to-be-processed text set acquisition submodule is used for acquiring the to-be-processed text set;

the reference structured text extraction sub-module is used for extracting a reference structured text corresponding to the text to be processed by utilizing a trained text processing model based on each text to be processed;

and the structured text set constructing sub-module is used for constructing the structured text set based on the reference structured text.

24. A training apparatus for a text processing model, comprising:

the prediction structured text acquisition module is used for inputting the text sample to be processed into a text processing model to be trained to obtain a prediction structured text;

25. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.