CN114328800A

CN114328800A - Text processing method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN114328800A
Application number: CN202111351720.9A
Authority: CN
Inventors: 陈震鸿
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2022-04-12

Abstract

The embodiment of the invention discloses a text processing method, a text processing device, electronic equipment and a computer readable storage medium; after a text word sample and a text sample pair are obtained, a preset text processing model is adopted to perform word segmentation on the text sample pair, characteristic extraction is performed on a target text word and the text word sample to obtain text word characteristics and text word sample characteristics, keyword category identification is performed on the target text word and the text word sample based on the text word characteristics and the text word sample characteristics, then, the text word characteristics are weighted according to an identified first keyword category to obtain the text characteristics of the text sample, characteristic distances among the text characteristics are calculated, then, the preset text processing model is converged based on an identified second keyword category, a labeled keyword category, the characteristic distances and a labeled semantic matching relation, and the trained text processing model is adopted to retrieve a target text; the scheme can improve the accuracy of text processing.

Description

Text processing method and device, electronic equipment and computer readable storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a text processing method and apparatus, an electronic device, and a computer-readable storage medium.

Background

In recent years, with the rapid development of internet technology, a large amount of texts appears on a network, and it is necessary to search for a desired target text from among the large amount of texts. In the process of retrieving these texts, the texts often need to be processed, so as to realize online retrieval of the texts. The existing text processing method usually extracts text features of texts through a Bert model (double-tower model), and retrieves target texts from massive texts according to the text features.

In the process of research and practice of the prior art, the inventor of the present invention finds that some information noise often exists in text features extracted through a Bert model, so that the accuracy of the extracted text features is insufficient, and therefore, the accuracy of text processing is insufficient.

Disclosure of Invention

The embodiment of the invention provides a text processing method and device, electronic equipment and a computer readable storage medium, which can improve the accuracy of text processing.

A text processing method, comprising:

acquiring a text word sample and a text sample pair, wherein the text word sample comprises a text word labeled with a keyword category, and the text sample pair comprises a text pair labeled with a semantic matching relationship;

segmenting words of text samples in the text sample pairs by adopting a preset text processing model, and extracting characteristics of segmented target text words and text word samples to obtain text word characteristics of the target text words and text word sample characteristics of the text word samples;

performing keyword category identification on the target text word and the text word sample based on the text word characteristics and the text word sample characteristics to obtain a first keyword category of the target text word and a second keyword category of the text word sample;

according to the first keyword category, weighting the text word characteristics to obtain the text characteristics of each text sample in the text sample pair, and calculating the characteristic distance between the text characteristics;

and converging a preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and retrieving a target text by adopting the trained text processing model.

Correspondingly, an embodiment of the present invention provides a text processing apparatus, including:

the system comprises an acquisition unit, a semantic matching unit and a semantic matching unit, wherein the acquisition unit is used for acquiring a text word sample and a text sample pair, the text word sample comprises a text word labeled with a keyword category, and the text sample pair comprises a text pair labeled with a semantic matching relation;

the word segmentation unit is used for performing word segmentation on a text sample in the text sample pair by adopting a preset text processing model, and performing feature extraction on a segmented target text word and a text word sample to obtain a text word feature of the target text word and a text word sample feature of the text word sample;

the recognition unit is used for carrying out keyword category recognition on the target text words and the text word samples based on the text word characteristics and the text word sample characteristics to obtain a first keyword category of the target text words and a second keyword category of the text word samples;

the weighting unit is used for weighting the text word features according to the first keyword category to obtain the text features of each text sample in the text sample pair, and calculating the feature distance between the text features;

and the retrieval unit is used for converging a preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and retrieving the target text by adopting the trained text processing model.

Optionally, in some embodiments, the weighting unit may be specifically configured to determine a text weight of the text word feature according to the first keyword category; and weighting the text word features based on the text weight, and fusing the weighted text word features to obtain the text features of each text sample in the text sample pair.

Optionally, in some embodiments, the weighting unit may be specifically configured to identify a category probability of each keyword category in the first keyword category to obtain a first category probability; screening out the category probability of at least one preset keyword category from the first category probability to obtain a basic category probability; and fusing the basic category probability to obtain the text weight of the text word characteristics.

Optionally, in some embodiments, the weighting unit may be specifically configured to fuse the weighted text word features to obtain fused text word features; extracting query text features corresponding to the query text samples and at least one field text feature corresponding to the target text sample from the fused text features; and fusing the field text features to obtain target field text features, and taking the target field text features and the query text features as the text features of each text sample in the text sample pair.

Optionally, in some embodiments, the weighting unit may be specifically configured to perform association feature extraction on the field text feature to obtain an association feature of the field text feature; determining an association weight of the field text features based on the association features, wherein the association weight is used for indicating an association relation between the field text features; and weighting the field text features according to the association weight, and fusing the weighted field text features to obtain target field text features.

Optionally, in some embodiments, the retrieving unit may be specifically configured to determine keyword loss information of the text word sample based on the second keyword category and the labeled keyword category; determining text loss information of the text sample pair according to the labeled semantic matching relation and the characteristic distance; and converging the preset text processing model based on the keyword loss information and the text loss information to obtain a trained text processing model.

Optionally, in some embodiments, the retrieving unit may be specifically configured to identify a category probability of each keyword category in the second keyword category to obtain a second category probability; screening out category probabilities corresponding to the labeled keyword categories from the second category probabilities to obtain target category probabilities; and fusing the target category probability and the labeled keyword category, and calculating the mean value of the fused keyword category to obtain the keyword loss information of the text word sample.

Optionally, in some embodiments, the retrieving unit may be specifically configured to determine a matching parameter of the text sample pair according to the labeled semantic matching relationship; and when the matching parameters are preset matching parameters and the characteristic distance is smaller than a preset distance threshold, fusing the matching parameters and the characteristic distance to obtain text loss information of the text sample pair.

Optionally, in some embodiments, the retrieving unit may be specifically configured to calculate a distance difference between the feature distance and the preset distance threshold; calculating a parameter difference value between the matching parameter and a preset parameter threshold value, and fusing the distance difference value and the parameter difference value; and fusing the fused difference, the matching parameters and the characteristic distance to obtain text loss information of the text sample.

Optionally, in some embodiments, the retrieving unit may be specifically configured to obtain a loss weight, and weight the keyword loss information and the text loss information based on the loss weight respectively; fusing the weighted keyword loss information and the weighted text loss information to obtain target loss information; adopting the weighted keyword loss information to converge the keyword recognition network to obtain a trained keyword recognition network; and converging the feature extraction network by adopting target loss information to obtain a trained feature extraction network, and taking the trained keyword recognition network and the trained feature extraction network as a trained text processing model.

Optionally, in some embodiments, the recognition unit may be specifically configured to perform normalization processing on the text word features and the text word sample features respectively by using the keyword recognition network; mapping the category probability of the target text word belonging to each keyword category according to the normalized text word characteristics to obtain a first keyword category of the target text word; and mapping the category probability of the text word sample belonging to each keyword category based on the normalized text word sample characteristics to obtain a second keyword category of the text word sample.

Optionally, in some embodiments, the retrieval unit may be specifically configured to obtain a candidate text set, and perform feature extraction on each candidate text in the candidate text set by using the trained text processing model to obtain a candidate text feature set; constructing index information corresponding to the candidate text feature set according to the candidate text features in the candidate text feature set; and when a query text is received, screening at least one candidate text in the candidate text set as a target text according to the index information and the query text.

Optionally, in some embodiments, the retrieval unit may be specifically configured to perform feature extraction on the query text by using the trained text processing model to obtain a query text feature of the query text; based on the index information, at least one candidate text feature corresponding to the query text feature is retrieved from the candidate text feature set to obtain a target candidate text feature; and screening out candidate texts corresponding to the target candidate text characteristics from the candidate text set to obtain target texts corresponding to the query texts.

Optionally, in some embodiments, the obtaining unit may be specifically configured to obtain a text sample set, and screen at least one text sample and a semantic text sample corresponding to the text sample from the text sample set, where the semantic text sample is a text sample having a semantic relationship with the text sample; performing word segmentation on the text sample by adopting the preset text processing model, and marking a keyword category in the text word after word segmentation to obtain a text word sample; and according to the semantic relation between the text sample and the semantic text sample, labeling a semantic matching relation in a text pair consisting of the text sample and the semantic text sample to obtain a text sample pair.

In addition, an electronic device is further provided in an embodiment of the present invention, and includes a processor and a memory, where the memory stores an application program, and the processor is configured to run the application program in the memory to implement the text processing method provided in the embodiment of the present invention.

In addition, the embodiment of the present invention further provides a computer-readable storage medium, where a plurality of instructions are stored, and the instructions are suitable for being loaded by a processor to perform the steps in any one of the text processing methods provided by the embodiment of the present invention.

After a text word sample and a text sample pair are obtained, a preset text processing model is adopted to perform word segmentation on the text sample in the text sample pair, feature extraction is performed on a target text word and the text word sample after word segmentation to obtain text word features of the target text word and text word sample features of the text word sample, then keyword category identification is performed on the target text word and the text word sample based on the text word features and the text word sample features to obtain a first keyword category of the target text word and a second keyword category of the text word sample, then the text word features are weighted according to the first keyword category to obtain text features of each text sample in the text sample pair, feature distances among the text features are calculated, and then the preset text processing model is converged based on the second keyword category, labeled keyword category, feature distances and labeled semantic matching relationship, obtaining a trained text processing model, and searching a target text by adopting the trained text processing model; according to the scheme, the keyword category identification task and the semantic matching task are trained simultaneously through a multi-task framework, the first keyword category is identified, the text word features are weighted, and the word weight identification capability of the text processing model in the semantic matching task is enhanced in an explicit mode, so that the information noise is effectively reduced, and the accuracy of text processing can be improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene schematic diagram of a text processing method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a text processing method according to an embodiment of the present invention;

FIG. 3 is a retrieval diagram of text retrieval according to an embodiment of the present invention;

FIG. 4 is a diagram of a core framework of a text processing flow provided by an embodiment of the present invention;

FIG. 5 is a diagram of a multi-task learning framework in a text processing flow provided by an embodiment of the present invention;

FIG. 6 is a flowchart illustrating a keyword recognition task according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating a semantic matching task provided by an embodiment of the present invention;

FIG. 8 is a schematic flow chart of a text processing method according to an embodiment of the present invention;

FIG. 9 is a schematic structural diagram of a text processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a text processing method and device, electronic equipment and a computer readable storage medium. The text processing apparatus may be integrated in an electronic device, and the electronic device may be a server or a terminal.

The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Network acceleration service (CDN), big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

For example, referring to fig. 1, taking as an example that a text processing apparatus is integrated in an electronic device, after obtaining a text word sample and a text sample pair, the electronic device performs word segmentation on the text sample in the text sample pair by using a preset text processing model, performs feature extraction on a target text word and the text word sample after the word segmentation to obtain a text word feature of the target text word and a text word sample feature of the text word sample, then performs keyword category identification on the target text word and the text word sample based on the text word feature and the text word sample feature to obtain a first keyword category of the target text word and a second keyword category of the text word sample, then weights the text word features according to the first keyword category to obtain a text feature of each text sample in the text sample pair, and calculates a feature distance between the text features, then, and converging the preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and searching the target text by adopting the trained text processing model so as to improve the accuracy of text processing.

The following are detailed below. It should be noted that the following description of the embodiments is not intended to limit the preferred order of the embodiments.

The embodiment will be described from the perspective of a text processing apparatus, where the text processing apparatus may be specifically integrated in an electronic device, and the electronic device may be a server or a terminal; the terminal may include a tablet Computer, a notebook Computer, a Personal Computer (PC), a wearable device, a virtual reality device, or other intelligent devices capable of performing text processing.

A text processing method, comprising:

obtaining a text word sample and a text sample pair, wherein the text word sample comprises a text word labeled with a keyword category, the text sample pair comprises a text pair labeled with a semantic matching relationship, segmenting the text sample in the text sample pair by adopting a preset text processing model, extracting the characteristics of a segmented target text word and the text word sample to obtain the text word characteristics of the target text word and the text word sample characteristics of the text word sample, identifying the keyword categories of the target text word and the text word sample based on the text word characteristics and the text word sample characteristics to obtain a first keyword category of the target text word and a second keyword category of the text word sample, weighting the text word characteristics according to the first keyword category to obtain the text characteristics of each text sample in the text sample pair, and calculating the characteristic distance between the text characteristics, and converging the preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and retrieving the target text by adopting the trained text processing model.

As shown in fig. 2, the specific flow of the text processing method is as follows:

101. a text word sample and text sample pair is obtained.

The text word sample comprises text words labeled with keyword categories, the keyword categories are used for indicating keywords of which category the text words belong to, the keyword categories can be various, for example, the keyword categories can be divided into three categories, namely, non-keywords, general keywords and important keywords, the three categories of keywords can be respectively represented by different identifiers, for example, the non-keywords can be represented by 0, the general keywords can be represented by 1, the important keywords can be represented by 2, or other identifiers can be used for representing the keywords, but the identifiers of the different keyword categories are often different. For example, if the text sequence is "nurse-moon-sao appointment, customized screening professional profile", the corresponding tagging keyword category result may be "nurse/2, month/2, sao/2, pre/1, about/1, custom/0, system/0, screen/0, select/0, special/0, industry/0, lean/0, and profile/0".

The text sample pair comprises a text pair marked with a semantic matching relationship, wherein the semantic matching relationship can be semantic correlation between two sections of texts in the text pair, and if the correlation is high, the two sections of texts are considered to be matched. Many natural language processing tasks may be converted to semantic matching questions, for example, web searching may be abstracted as a user's query text (query) to web content correlation matching questions, auto-questioning may be abstracted as a question of satisfaction of the question with candidate answers, text deduplication may be abstracted as a question of similarity between text and text.

The method for obtaining the text word sample and the text sample pair may be various, and specifically may be as follows:

for example, a text sample set can be obtained, at least one text sample and a semantic text sample corresponding to the text sample are screened from the text sample set, a preset text processing model is adopted to perform word segmentation on the text sample, a keyword category is marked in the segmented text word to obtain a text word sample, and a semantic matching relationship is marked in a text pair formed by the text sample and the semantic text sample according to the semantic relationship between the text sample and the semantic text sample to obtain a text sample pair.

The semantic text sample is a text sample having a semantic relationship with a text sample, the semantic relationship is semantic matching and semantic mismatching, and the semantic text sample corresponding to the text sample can be screened out in various ways, for example, a search system can screen out a text sample semantically matched with the text sample from a text sample set as a semantic text sample, at this time, the semantic relationship between the screened out semantic text sample and the text sample can be semantic matching, the text sample pair at this time can be a positive text sample pair, the text sample in an offline library can also be randomly extracted as a semantic text sample corresponding to the text sample, at this time, the semantic relationship between the extracted semantic text sample and the text sample can be semantic mismatching, and the text sample pair at this time can be a negative text sample pair.

For example, a token segmentation method in a Bert network in a preset text processing model may be adopted to segment each text sample into a single token (token), where the token refers to a single Chinese word, an english word or a root word, and the like. For example, the keyword category of the text word can be directly identified, and then the keyword category is labeled on the text word according to the identification result, or the segmented text word can be sent to a labeling server, the keyword category of the text word returned by the labeling server is received, and the corresponding keyword category is labeled on the text word, so that a text word sample is obtained.

For example, the semantic matching relationship between the text sample and the semantic text sample can be determined according to the semantic relationship between the text sample and the semantic text sample, the text sample and the corresponding semantic text sample form a text pair, and the semantic matching relationship is labeled in the text pair, so that the text sample pair is obtained. The text sample pair may include a text sample pair and a negative text sample pair, the text sample in the text sample pair matches the semantic text sample, the text sample in the negative text sample pair does not match the semantic text sample, and in addition, when the text sample is a query text sample, the corresponding semantic text sample may be the target text sample.

102. And performing word segmentation on the text samples in the text sample pair by adopting a preset text processing model, and performing feature extraction on the segmented target text words and the text word samples to obtain the text features of the target text words and the text word sample features of the text word samples.

The method for segmenting the text samples in the text sample pairs by adopting the preset text processing model can be various, and specifically can be as follows:

for example, each text sample may be segmented into a single token by using a token segmentation method in a Bert network in a preset text processing model, where the token may be a single Chinese word, an English word or a root word, and the like, so as to obtain a target text word after each text sample in a text sample pair is segmented, or each text sample in a text sample pair may be directly subjected to character segmentation, and segmented into a single Chinese word, an English word or a root word, so as to obtain a target text word after each text sample in a text sample pair is segmented.

The text sample pair is used as a query text sample, the corresponding semantic text sample can be a target text sample, the target text sample can be firstly split into a plurality of fields, and then word segmentation is carried out on each field, so that a target text word after word segmentation of each field is obtained.

After the text samples in the text sample pair are segmented, feature extraction can be performed on the segmented target text words and text word samples, and the feature extraction mode can be various, for example, the feature extraction can be performed on the target text words and text word samples respectively by adopting a Bert network or XLNet, ELECTRA and other models in a preset text processing model, so as to obtain vector representation (q) of each target text word₁,q₂,q₃) And a vector representation (t) for each text word sample₁,t₂,t₃) And expressing the vector of the target text word as the text word characteristic of the target text word, and expressing the vector of the text word sample as the text word sample characteristic of the text word sample.

When the characteristics of the target text words and the text word samples are extracted, the network parameters of the Bert network can be shared.

103. And performing keyword category identification on the target text word and the text word sample based on the text word characteristics and the text word sample characteristics to obtain a first keyword category of the target text word and a second keyword category of the text word sample.

For example, a keyword recognition network in a preset text processing model may be used to respectively perform normalization processing on text word features and text word sample features, and according to the normalized text word features, a category probability that a target text word belongs to each keyword category is mapped to obtain a first keyword category of the target text word, and based on the normalized text word sample features, a category probability that a text word sample belongs to each keyword category is mapped to obtain a second keyword category of the text word sample.

The text word features and the text word sample features may be normalized in various ways, for example, the text word features and the text word sample features may be normalized by using a Fully-Connected neural network (FC).

After the text word feature normalization processing, the first keyword category of the target text word can be calculated based on the normalized text word feature, and the manner of calculating the first keyword category can be various, for example, the category probability of each target text word belonging to each keyword category is calculated by using a Softmax function, for example, the category probability is calculated by taking the keyword category as an example, and then the category probability can be divided into three categories

Wherein the content of the first and second substances,

as the probability that the ith target text word belongs to the category 0 keyword,

as the probability that the ith target text word belongs to the category 1 keyword,

for the probability that the ith target text word belongs to the category 2 keywords, the probability will be

A first keyword category as a target text word.

After the text word sample feature normalization processing, a second keyword category of the text word sample can be calculated based on the normalized text word sample feature, and there are various ways to calculate the second keyword category, for example, calculating a category probability that each text word sample belongs to each keyword category by using a Softmax function, for example, taking the keyword category as three categories as an example, the calculated category probability can be that the probability that the ith text word sample belongs to the 0 th category is

The probability of belonging to class 1 is

The probability of belonging to class 2 is

Will be provided with

And

a second keyword category as a sample of text words.

The keyword recognition network in the preset text processing model can be found for performing keyword category recognition on the target text words and the text word samples, the network structure of the keyword recognition network can be an FC-Softmax network, and in the process of recognizing the keyword categories of the text words and the text word samples, the FC-Softmax network and network parameters of the network are shared.

104. And according to the first keyword category, weighting the text word characteristics to obtain the text characteristics of each text sample in the text sample pair, and calculating the characteristic distance between the text characteristics.

For example, the text weight of the text word features may be determined according to the first keyword category, the text word features are weighted based on the text weight, the weighted text word features are fused to obtain the text features of each text sample in the text sample pair, and the feature distance between the text features may be specifically as follows:

and S1, determining the text weight of the text word feature according to the first keyword category.

The text weight is used for indicating the importance degree of the target text word corresponding to the text word characteristic in each target text word in the text sample pair. Based on the text weight, the text features of each text sample in the text sample pair can be more accurately represented.

The text weight determining method for determining the text word features according to the first keyword category may be various, and specifically may be as follows:

for example, the category probability of each keyword category is identified in the first keyword category to obtain a first category probability, the category probability of at least one preset keyword category is screened out from the first category probability to obtain a basic category probability, and the basic category probabilities are fused to obtain the text weight of the text word features.

Wherein the first class probability may be

There are various ways to screen the category probability of at least one preset keyword category in the first category probability, for example, the category probabilities that the keyword categories are general keywords and important keywords can be screened in the first category probability to obtain the basic category probability, and taking the 0 th category as non-keywords, the 1 st category as general keywords and the 2 nd category as important keywords as examples, the basic category probability can be obtained by

Screening out

As the base class probability.

After the basic category probabilities are screened out, the basic category probabilities may be fused to obtain the text weight of the text word feature, and the fusion manner may be multiple, for example, a fusion parameter of each basic category probability is obtained, the fusion parameters are respectively fused with the corresponding basic category probabilities, then the fused basic category probabilities are added to obtain a target basic category probability, and then a mean value of the target basic category probabilities is calculated, so that the text weight of the text word feature may be specifically as shown in formula (1):

wherein, w_qiThe text weight that is characteristic of the ith text word,

and

respectively, the base class probabilities.

S2, based on the text weight, the text word features are weighted, and the weighted text word features are fused to obtain the text features of each text sample in the text sample pair.

The text sample pair comprises a query text sample and a target text sample, wherein the query text sample can be a text sample, and the target text sample can be a semantic text sample corresponding to the text sample.

The text word features are weighted based on the text weight to obtain weighted text word features, and the weighted text word features are fused in various ways, specifically, the following ways are available;

for example, text word features are weighted based on text weights to obtainThe weighted text word features are fused to obtain fused text word features (vec)_q＝∑_iw_qi*q_i). Extracting query text features corresponding to the query text sample and at least one field text feature corresponding to the target text sample from the fused text features, fusing the field text features to obtain target field text features, and taking the target field text features and the query text features as text features of each text sample in the text sample pair.

The method for extracting the query text features corresponding to the query text sample and the at least one field text feature corresponding to the target text sample from the fused text features may be various, for example, the text features belonging to the query text sample may be screened out from the fused text features, so that the query text features corresponding to the query text sample may be obtained, and the text features belonging to each field of the target text sample may be screened out from the fused text features, so that the at least one field text feature may be obtained.

After the field text features are screened out, the field text features can be fused to obtain target field text features, various ways of fusing the field text features can be provided, for example, associated feature extraction is performed on the field text features to obtain associated features of the field text features, associated weights of the field text features are determined based on the associated features, the associated weights are used for indicating associated relations among the field text features, the field text features are weighted according to the associated weights, the weighted field text features are fused to obtain target field text features, and the target field text features can be understood as text features corresponding to target text samples.

The method for determining the association weight may be various, for example, an Attention network (Attention network) may be adopted to extract the association feature of the field text feature, and then the association weight of each field text feature is calculated based on the association feature.

And S3, calculating the feature distance between the text features.

The feature distance is used to indicate a semantic matching relationship between text samples corresponding to text features, and the type of the feature distance may be various, and for example, the feature distance may include a plurality of distance forms such as a euclidean distance or a cosine distance.

The method for calculating the feature distance between the text features may be various, and specifically may be as follows:

for example, cosine distances between text features may be directly calculated to obtain feature distances between text features, or euclidean distances between text features may be calculated to obtain feature distances between text features.

105. And converging the preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and retrieving the target text by adopting the trained text processing model.

For example, the keyword loss information of the text word sample may be determined based on the second keyword class and the labeled keyword class, the text loss information of the text sample pair may be determined according to the labeled semantic matching relationship and the characteristic distance, the preset text processing model may be converged based on the keyword loss information and the text loss information to obtain a trained text processing model, and the trained text processing model is used to retrieve the target text, which may specifically be as follows:

and C1, determining the keyword loss information of the text word sample based on the second keyword category and the labeled keyword category.

The keyword loss information may be loss information generated by a preset text processing model in a keyword category identification task.

Based on the second keyword category and the labeled keyword category, there may be various ways of determining the keyword loss information of the text word sample, and specifically the following ways may be used:

for example, the category probability of each keyword category is identified in the second keyword categories to obtain second category probabilities, the category probabilities corresponding to the labeled keyword categories are screened from the second category probabilities to obtain target category probabilities, the target category probabilities and the labeled keyword categories are fused, the mean value of the fused keyword categories is calculated, and keyword loss information of the text word sample is obtained.

Wherein, taking the keyword category as three categories as an example, the probability of identifying the second category in the second keyword category may be

And

there are various ways to screen the target category probability from the second category probability, for example, when the labeled keyword category of the ith text word sample is category 1, the target category probability can be screened

And

screening out

Is the target class probability.

After the target category probability is screened out, the target category probability and the labeled keyword category can be fused, and the fusion mode can be various, for example, a keyword category parameter of the text word sample can be determined according to the labeled keyword category, when the ith text word sample belongs to the c-th category keyword, the keyword parameter can be 1, and when the ith text word sample is not the c-th category keyword, the keyword parameter can be 0. After the target category probability is preprocessed, multiplying the preprocessed target category probability by the keyword parameter to obtain basic keyword loss information of the ith text word sample, then accumulating the basic keyword loss information of the text word sample, and calculating the average value of the accumulated keyword loss information, thereby obtaining the keyword loss information of the text word sample, which can be specifically shown as formula (2):

therein, Loss_keywordInformation is lost for the keywords of the text word sample,

labeling the keyword category (class c),

is the target class probability (class probability belonging to class C).

And C2, determining text loss information of the text sample pairs according to the labeled semantic matching relationship and the characteristic distance.

The text loss information may be loss information generated by a preset text processing model in a semantic matching task. The semantic matching task can be understood to calculate semantic matching relationships between text samples in a text sample pair.

The text loss information of the text sample pair can be determined in various ways according to the labeled semantic matching relationship and the characteristic distance, and the ways can be specifically as follows:

for example, according to the labeled semantic matching relationship, determining matching parameters of the text sample pair, and when the matching parameters are preset matching parameters and the characteristic distance is smaller than a preset distance threshold, fusing the matching parameters and the characteristic distance to obtain text loss information of the text sample pair.

For example, when the semantic matching relationship of the text samples in the text sample pair is matching, the corresponding matching parameter may be 1, and when the semantic matching relationship of the text samples in the text sample pair is not matching, the corresponding matching parameter may be 0, of course, the matching parameter may also be other parameter values, it needs to be described that the semantic matching relationship is different, and the corresponding matching parameters are also different. Taking the matching parameter as 0 or 1 as an example, when the matching parameter is 0 and the characteristic distance is greater than the preset distance threshold, the text loss information of the text sample pair may be 0, and when the matching parameter is 1 and the characteristic distance is less than the preset distance threshold, the text loss information only exists in the text sample pair. Therefore, the condition that the text sample pair has text loss information is that the matching parameter is a preset matching parameter, and the characteristic distance is smaller than a preset distance threshold.

Under the condition that the matching parameter is the preset matching parameter and the characteristic distance is smaller than the preset distance threshold, there are various ways of fusing the matching parameter and the characteristic distance, for example, calculating a distance difference between the characteristic distance and the preset distance threshold, calculating a parameter difference between the matching parameter and the preset parameter threshold, fusing the distance difference and the parameter difference, and fusing the fused difference, the matching parameter and the characteristic distance to obtain text loss information of a text sample pair, which may be specifically shown in formula (3):

therein, Loss_matchThe text loss information of the text sample pair is obtained, N is the number of text samples in the text sample pair, y is a matching parameter, d is a characteristic distance, and margin is a hyper parameter and is used for indicating a preset distance threshold. The loss function for calculating the text loss information is a contrast loss function, the contrast loss function is used for emphatically learning parameters of related samples, irrelevant samples larger than margin are ignored, the under-recall problem is well solved, and the similarity can be conveniently calculated by an online retrieval module by adopting cosine distances.

And C3, converging the preset text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model.

The preset text processing model comprises a feature extraction network and a keyword recognition network.

The convergence mode of the preset text processing model may be various, and specifically may be as follows:

for example, loss weights are obtained, the keyword loss information and the text loss information are weighted based on the loss weights, the weighted keyword loss information and the weighted text loss information are fused to obtain target loss information, the weighted keyword loss information is adopted to converge the keyword recognition network to obtain a trained keyword recognition network, the target loss information is adopted to converge the feature extraction network to obtain a trained feature extraction network, and the trained keyword recognition network and the trained feature extraction network are used as the trained processing model.

The method for fusing the weighted keyword loss information and the weighted text loss information may be various, for example, the weighted keyword loss information and the weighted text loss information may be directly added, so that target loss information corresponding to a preset text processing model may be obtained, and may be specifically as shown in formula (4):

Loss_total＝α*Loss_match+β*Loss_keyword(4)

therein, Loss_totalFor the target Loss information, α and β are the Loss weights, for text information and keyword information, respectively, Loss_matchLoss of information for text, Loss_keywordThe information is lost for the keyword.

After the target loss information is obtained, the network parameters of the preset text processing model can be updated through back propagation, then, iteration is performed for multiple times until convergence is achieved, and in the convergence process, it needs to be stated that for the keyword recognition network in the preset text processing model, only the weighted keyword loss information needs to be adopted to update the network parameters of the keyword recognition network, and the weighted text loss information does not need to update the network parameters of the keyword recognition network. For a feature extraction network other than the keyword recognition network in the preset text processing model, the network parameters of the feature extraction network can be updated by using the target loss information. And performing repeated iterative training on the keyword recognition network and the feature extraction network until convergence, thereby obtaining the trained text processing model. It should be noted that the keyword recognition task is an auxiliary task for enhancing the noise reduction capability of the semantic matching model, and the weight of the loss value is relatively low.

And C4, searching the target text by adopting the trained text processing model.

The target text is a text which is inquired through the inquiry text and has a semantic matching relationship with the inquiry text.

The method for searching the target text by adopting the trained text processing model can be various, and specifically can be as follows:

for example, a candidate text set can be obtained, a trained text processing model is adopted to perform feature extraction on each candidate text in the candidate text set to obtain a candidate text feature set, index information corresponding to the candidate text feature set is constructed according to candidate text features in the candidate text feature set, and when a query text is received, at least one candidate text is screened out from the candidate text set according to the index information and the query text.

The candidate text set is processed mainly through an off-line processing mode, text features of all candidate texts are calculated in advance in an off-line mode through a trained processing model, so that a candidate text feature set corresponding to the candidate text set is obtained, an index database of the candidate text feature set is built through an index building tool, and the candidate text feature set is provided for an on-line retrieval system to be retrieved. The type of index building tool may be varied, and may include, for example, Faiss or nmslib indexing tools.

After index information corresponding to the candidate text feature set is constructed, online retrieval can be performed, the on-line retrieval process mainly comprises the steps of deploying the trained text processing model to an on-line module, when a user inputs a query text, according to the index information and the query text, at least one candidate text is screened out from the candidate text set as a target text, and the screening mode can be various, for example, the trained text processing model can be adopted to perform feature extraction on the query text to obtain the query text features of the query text, and based on the index information, at least one candidate text feature corresponding to the query text feature is retrieved from the candidate text feature set to obtain a target candidate text feature, and screening candidate texts corresponding to the target candidate text characteristics from the candidate text set to obtain target texts corresponding to the query texts.

The method for retrieving at least one candidate text feature corresponding to the query text feature in the candidate text feature set based on the index information may be various, for example, feature similarity between the query text feature and the candidate text feature may be calculated through the index information, and then Top K candidate text features with the highest similarity are retrieved in the candidate text feature set based on the feature similarity, so as to obtain the target candidate text feature.

After the trained text processing model is obtained, the candidate text features of the candidate text need to be predicted offline, index information corresponding to the candidate text features is constructed, and then the candidate text corresponding to the query text is retrieved online, so that at least one target text corresponding to the query text is obtained. The retrieved target text and the relevant features thereof are returned to the client through the downstream module for displaying, and the process of text retrieval by the user at the client to return the retrieved target text can be as shown in fig. 3. The method comprises the steps that a user can input text information of a service or service needing to be searched through a search control of an application platform, a client sends a query text (query) input by the user to a server, the server retrieves at least one target text (doc) related to the query in a candidate text set and returns the retrieved doc information to the client, and the client displays the returned doc information.

The whole core framework of text processing can be mainly divided into three stages, namely a multi-task learning stage, an offline vector library generation stage and an online vector retrieval stage, as shown in fig. 4.

The multi-task learning stage is mainly used for training a preset text processing model, in the process of training the preset text processing model, a keyword recognition task and a semantic matching task are adopted to train the preset text processing model, and the two tasks can share an important module to perform parallel training to form a multi-task learning framework, as shown in fig. 5.

In the process of the keyword recognition task, segmenting a text sample into tokens, and inputting the text characteristics of each segmented token through a bert model, wherein the text characteristics can be text vectors (t)₁,t₂,t₃) Then, the class probability that each token belongs to each keyword class is calculated by FC-Softmax (keyword recognition network), and then, the keyword loss information is calculated by CE loss (keyword loss function), which may be specifically shown in fig. 6.

The text vector refers to a fixed-length numeric vector converted from a piece of text with an indefinite length in a certain mode. Vectors can be divided into two forms: one is a high-dimensional sparse vector, the length of a word list is usually taken as the length of the vector, each dimension represents a word, only the dimension corresponding to a text word has a nonzero value, and most of the dimensions are zero; the other is a low-dimensional dense vector, text can be input into a model such as a neural network, and the like, and each dimension of the vector is basically a nonzero value and has no clear physical meaning through training output vector representation, but the effect is generally better than that of a high-dimensional sparse vector.

In the process of a semantic matching task, a plurality of queries are randomly extracted from a text sample set, doc with similar semantics is retrieved through a retrieval system to serve as a positive example (label is 1), doc in an offline library is randomly extracted to serve as a negative example (label is 0), and binary group data consisting of the queries, the doc and the labels is used as a text sample pair. The binary data is that in the context of text matching, a binary data includes two texts and a label (represented by 0 or 1). Assuming that the two texts are A and B, if the two texts are matched, the binary data is (A, B, 1); if not, the binary data is (A, B, 0). The double-tower model is trained through the text sample pair, wherein the double-tower model can be understood as that text vectors of query and doc are generated by respectively adopting a keyword recognition network and a feature extraction network, so that the text feature of each text sample in the text sample pair is obtained. Then, the text loss information of the text vector of the query and the text vector of the doc is calculated by the comparison loss function, which may be specifically shown in fig. 7.

In the multi-task learning training process, two tasks are simultaneously transmitted in the forward direction (share a bert model and an FC-Softmax network), keyword loss information and text loss information are calculated, then, weighted summation is carried out to obtain overall loss information, the bert model is converged based on the overall loss information, the FC-Softmax network is converged based on the weighted keyword loss information, and therefore the processing model after training is obtained. In addition, other auxiliary tasks can be added in the process of training the preset text processing model to provide the precision and generalization capability of the preset text processing model in the semantic matching scene.

The method comprises the steps of segmenting text samples by adopting a bert model, wherein the segmented granularity is token, Chinese word segmentation is not needed, precision errors caused by word segmentation tools are avoided, in addition, the bert model can be insufficient for text sequence information, the feature extraction energy is better, the precision of extracted text features is more accurate, in addition, a three-classification keyword recognition task is designed in the keyword recognition task, the text weight of each token can be effectively obtained by fusing the predicted category probability of the keyword category, the keyword recognition capability of a preset text processing model is remarkably improved, the semantic relevance calculation precision of query and doc is further improved, and the problems of insufficient recall and reverse sequencing can be effectively relieved.

As can be seen from the above, in the embodiments of the present application, after obtaining a text word sample and a text sample pair, a preset text processing model is used to perform word segmentation on the text sample in the text sample pair, and feature extraction is performed on a target text word and the text word sample after word segmentation to obtain a text word feature of the target text word and a text word sample feature of the text word sample, then, based on the text word feature and the text word sample feature, keyword category recognition is performed on the target text word and the text word sample to obtain a first keyword category of the target text word and a second keyword category of the text word sample, then, according to the first keyword category, the text word feature is weighted to obtain a text feature of each text sample in the text sample pair, and a feature distance between the text features is calculated, then, based on the second keyword category, the keyword category is labeled, Converging the preset text processing model by using the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and searching a target text by using the trained text processing model; according to the scheme, the keyword category identification task and the semantic matching task are trained simultaneously through a multi-task framework, the first keyword category is identified, the text word features are weighted, and the word weight identification capability of the text processing model in the semantic matching task is enhanced in an explicit mode, so that the information noise is effectively reduced, and the accuracy of text processing can be improved.

The method described in the above examples is further illustrated in detail below by way of example.

In this embodiment, the text processing apparatus is specifically integrated in an electronic device, the electronic device is a server, and the keyword categories include three types: non-keywords (denoted by 0), general keywords (denoted by 1), and important keywords (denoted by 2) will be described as examples.

As shown in fig. 8, a text processing method specifically includes the following steps:

201. the server obtains text word samples and text sample pairs.

For example, the server may obtain a text sample set, screen out at least one text sample from the text sample set, screen out a text sample that semantically matches the text sample from the text sample set as a semantic text sample through the retrieval system, at this time, the semantic relationship between the screened semantic text sample and the text sample may be semantic matching, the text sample pair at this time may be a positive text sample pair, and the text sample in the offline library may also be randomly extracted as a semantic text sample corresponding to the text sample, at this time, the semantic relationship between the extracted semantic text sample and the text sample may be semantic mismatching, and the text sample pair at this time may be a negative text sample pair.

The server can use a token segmentation method in a Bert network in a preset text processing model to segment each text sample into a single token, directly identify the keyword category of the token, label the keyword category on the token according to an identification result, or send the segmented token to a label server, receive the keyword category of the token returned by the label server, and label the corresponding keyword category on the token, thereby obtaining the text word sample.

The server can determine the semantic matching relationship between the text sample and the semantic text sample according to the semantic relationship between the text sample and the semantic text sample, form a text pair by the text sample and the corresponding semantic text sample, and label the semantic matching relationship in the text pair, thereby obtaining the text sample pair. The text sample pair may include a text sample pair and a negative text sample pair, the text sample in the text sample pair matches the semantic text sample, the text sample in the negative text sample pair does not match the semantic text sample, and in addition, when the text sample is a query text sample, the corresponding semantic text sample may be the target text sample.

202. The server performs word segmentation on the text samples in the text sample pair by adopting a preset text processing model, and performs feature extraction on the segmented target text words and the text word samples to obtain text features of the target text words and text word sample features of the text word samples.

For example, the server may segment each text sample into a single token by using a token segmentation method in a Bert network in a preset text processing model, so as to obtain a target text word after segmenting each text sample in the text sample pair, or may directly segment each text sample in the text sample pair into characters and segmenting the characters into single Chinese characters, English words or word roots, so as to obtain a target text word after segmenting each text sample in the text sample pair.

The server can adopt Bert network or XLNET, ELECTRA and other models in the preset text processing modelRespectively extracting the characteristics of the target text words and the text word samples to obtain the vector representation (q) of each target text word₁,q₂,q₃) And a vector representation (t) for each text word sample₁,t₂,t₃) And expressing the vector of the target text word as the text word characteristic of the target text word, and expressing the vector of the text word sample as the text word sample characteristic of the text word sample.

203. The server identifies the keyword categories of the target text words and the text word samples based on the text word characteristics and the text word sample characteristics to obtain a first keyword category of the target text words and a second keyword category of the text word samples.

For example, the server may use a fully-connected neural network to normalize the text word features and the text word sample features separately. Calculating category probability of each target text word belonging to each keyword category through Softmax function

Will be provided with

A first keyword category as a target text word. Calculating a category probability that each text word sample belongs to each keyword category by a Softmax function ((

And

) Will be

And

a second keyword category as a sample of text words.

204. And the server determines the text weight of the text word characteristic according to the first keyword category.

For example, the server identifies a category probability for each keyword category in the first keyword category

Screening the category probability of the keyword category as a common keyword and an important keyword from the first category probability to obtain a basic category probability

Acquiring a fusion parameter of each basic category probability, fusing the fusion parameters with the corresponding basic category probabilities respectively, then adding the fused basic category probabilities to obtain a target basic category probability, and then calculating a mean value of the target basic category probabilities, so that the text weight of the text word features can be specifically shown in formula (1).

205. And the server weights the text word features based on the text weight, and fuses the weighted text word features to obtain the text features of each text sample in the text sample pair.

For example, the server weights the text word features based on the text weights to obtain weighted text word features, and fuses the weighted text word features to obtain fused text features (vec)_q＝∑_iw_qi*q_i). And screening out text features belonging to the query text sample from the fused text features so as to obtain the query text features corresponding to the query text sample, and screening out the text features belonging to each field of the target text sample from the fused text features so as to obtain at least one field text feature. Extracting the associated features of the field text features to obtain the associated features of the field text features, and based on the relationshipAnd determining the association weight of the field text features, wherein the association weight is used for indicating the association relationship among the field text features, weighting the field text features according to the association weight, and fusing the weighted field text features to obtain target field text features, wherein the target field text features can be understood as text features corresponding to the target text sample. And taking the target field text characteristic and the query text characteristic as the text characteristic of each text sample in the text sample pair.

206. The server calculates feature distances between the text features.

For example, the server may directly calculate cosine distances between text features to obtain feature distances between the text features, or may also calculate euclidean distances between the text features to obtain feature distances between the text features.

207. And the server converges the preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance and the labeled semantic matching relation to obtain the trained text processing model.

For example, the server may determine keyword loss information of a text word sample based on the second keyword class and the labeled keyword class, determine text loss information of a text sample pair according to the labeled semantic matching relationship and the characteristic distance, and converge the preset text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model, which may specifically be as follows:

(1) and the server determines the keyword loss information of the text word sample based on the second keyword category and the labeled keyword category.

For example, the server identifies the category probability of each keyword category in the second keyword category, and obtains a second category probability (b

And

) When the ith text word sampleWhen the labeled keyword category is the 1 st category, the method can be carried out

And

screening out

Is the target class probability. And determining a keyword category parameter of the text word sample according to the labeled keyword category, wherein when the ith text word sample belongs to the class c keyword, the keyword parameter can be 1, and the keyword parameter can be 0 if the ith text word sample is positive. After the target category probability is preprocessed, multiplying the preprocessed target category probability by the keyword parameter to obtain basic keyword loss information of the ith text word sample, then accumulating the basic keyword loss information of the text word sample, and calculating the average value of the accumulated keyword loss information, thereby obtaining the keyword loss information of the text word sample, which can be specifically shown in formula (2).

(2) And the server determines the text loss information of the text sample pair according to the labeled semantic matching relationship and the characteristic distance.

For example, when the semantic matching relationship of the text samples in the text sample pair is matching, the server may determine that the corresponding matching parameter may be 1, and when the semantic matching relationship of the text samples in the text sample pair is not matching, the server may determine that the corresponding matching parameter may be 0. When the matching parameter is 1 and the characteristic distance is smaller than the preset distance threshold, calculating a distance difference between the characteristic distance and the preset distance threshold, calculating a parameter difference between the matching parameter and the preset parameter threshold, fusing the distance difference and the parameter difference, and fusing the fused difference, the matching parameter and the characteristic distance to obtain text loss information of the text sample pair, which can be specifically shown in formula (3).

(3) And the server converges the preset text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model.

For example, the server obtains the loss weight, respectively weights the keyword loss information and the text loss information based on the loss weight, and directly adds the weighted keyword loss information and the weighted text loss information, so as to obtain target loss information corresponding to the preset text processing model, which may be specifically shown in formula (4). And converging the keyword recognition network by adopting the weighted keyword loss information to obtain a trained keyword recognition network, converging the feature extraction network by adopting the target loss information to obtain a trained feature extraction network, and taking the trained keyword recognition network and the trained feature extraction network as the trained processing model.

208. And the server searches the target text by adopting the trained text processing model.

For example, the server may obtain a candidate text set, calculate text features of all candidate texts offline in advance through a trained text processing model to obtain a candidate text feature set corresponding to the candidate text set, construct an index library of the candidate text feature set by using an indexing tool such as Faiss or nmslib, and provide the index library to the online retrieval system for retrieval. Deploying the trained text processing model to an online module, when a user inputs a query text, performing feature extraction on the query text by using the trained text processing model to obtain query text features of the query text, calculating feature similarity between the query text features and candidate text features through index information, and then retrieving Top K candidate text features with highest similarity from a candidate text feature set based on the feature similarity so as to obtain target candidate text features. And screening candidate texts corresponding to the target candidate text characteristics from the candidate text set to obtain target texts corresponding to the query texts. And returning the retrieved target text and the relevant characteristics thereof to the client through a downstream module for displaying.

As can be seen from the above, after obtaining the text word sample and the text sample pair, the server in this embodiment performs word segmentation on the text sample in the text sample pair by using the preset text processing model, performs feature extraction on the segmented target text word and the text word sample to obtain a text word feature of the target text word and a text word sample feature of the text word sample, performs keyword category identification on the target text word and the text word sample based on the text word feature and the text word sample feature to obtain a first keyword category of the target text word and a second keyword category of the text word sample, then weights the text word features according to the first keyword category to obtain a text feature of each text sample in the text sample pair, and calculates a feature distance between the text features, and then labels the keyword categories based on the second keyword category, Converging the preset text processing model by using the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and searching a target text by using the trained text processing model; according to the scheme, the keyword category identification task and the semantic matching task are trained simultaneously through a multi-task framework, the first keyword category is identified, the text word features are weighted, and the word weight identification capability of the text processing model in the semantic matching task is enhanced in an explicit mode, so that the information noise is effectively reduced, and the accuracy of text processing can be improved.

In order to better implement the above method, the embodiment of the present invention further provides a text processing apparatus, which may be integrated in an electronic device, such as a server or a terminal, and the terminal may include a tablet computer, a notebook computer, and/or a personal computer.

For example, as shown in fig. 9, the text processing apparatus may include an acquisition unit 301, a segmentation unit 302, a recognition unit 303, a weighting unit 304, and a retrieval unit 305 as follows:

(1) an acquisition unit 301;

an obtaining unit 301, configured to obtain a text word sample and a text sample pair, where the text word sample includes a text word labeled with a keyword category, and the text sample pair includes a text pair labeled with a semantic matching relationship.

For example, the obtaining unit 301 may be specifically configured to obtain a text sample set, screen at least one text sample and a semantic text sample corresponding to the text sample from the text sample set, perform word segmentation on the text sample by using a preset text processing model, mark a keyword category in the text word after the word segmentation to obtain a text word sample, and mark a semantic matching relationship in a text pair composed of the text sample and the semantic text sample according to a semantic relationship between the text sample and the semantic text sample to obtain a text sample pair.

(2) A word segmentation unit 302;

the word segmentation unit 302 is configured to perform word segmentation on a text sample in the text sample pair by using a preset text processing model, and perform feature extraction on the segmented target text word and the text word sample to obtain a text word feature of the target text word and a text word sample feature of the text word sample.

For example, the word segmentation unit 302 may be specifically configured to segment each text sample into a single word in chinese, a word in english, or a root word, so as to obtain a target text word after each text sample in the text sample pair is segmented. Respectively extracting the characteristics of the target text words and the text word samples by adopting a Bert network or XLNET, ELECTRA and other models in a preset text processing model so as to obtain the vector representation (q) of each target text word₁,q₂,q₃) And a vector representation (t) for each text word sample₁,t₂,t₃) And expressing the vector of the target text word as the text word characteristic of the target text word, and expressing the vector of the text word sample as the text word sample characteristic of the text word sample.

(3) An identification unit 303;

the identifying unit 303 is configured to perform keyword category identification on the target text word and the text word sample based on the text word feature and the text word sample feature, so as to obtain a first keyword category of the target text word and a second keyword category of the text word sample.

For example, the identifying unit 303 may be specifically configured to perform normalization processing on the text word features and the text word sample features respectively by using a keyword identification network in a preset text processing model, map a category probability that a target text word belongs to each keyword category according to the normalized text word features, obtain a first keyword category of the target text word, map a category probability that a text word sample belongs to each keyword category based on the normalized text word sample features, and obtain a second keyword category of the text word sample.

(4) A weighting unit 304;

and the weighting unit 304 is configured to weight the text word features according to the first keyword category to obtain text features of each text sample in the text sample pair, and calculate a feature distance between the text features.

For example, the weighting unit 304 may be specifically configured to determine a text weight of a text word feature according to the first keyword category, weight the text word feature based on the text weight, fuse the weighted text word features to obtain a text feature of each text sample in a text sample pair, and calculate a feature distance between the text features.

(5) A retrieval unit 305;

and the retrieving unit 305 is configured to converge the preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance, and the labeled semantic matching relationship, obtain a trained text processing model, and retrieve the target text by using the trained text processing model.

For example, the retrieving unit 305 may be specifically configured to determine keyword loss information of a text word sample based on the second keyword class and the labeled keyword class, determine text loss information of a text sample pair according to the labeled semantic matching relationship and the characteristic distance, and converge the preset text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model. Acquiring a candidate text set, extracting the characteristics of each candidate text in the candidate text set by adopting a trained text processing model to obtain a candidate text characteristic set, constructing index information corresponding to the candidate text characteristic set according to the candidate text characteristics in the candidate text characteristic set, and screening at least one candidate text in the candidate text set according to the index information and a query text when the query text is received.

In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.

As can be seen from the above, in this embodiment, after the obtaining unit 301 obtains a text word sample and a text sample pair, the word segmentation unit 302 performs word segmentation on the text sample in the text sample pair by using a preset text processing model, and performs feature extraction on a target text word and the text word sample after word segmentation to obtain a text word feature of the target text word and a text word sample feature of the text word sample, then the recognition unit 303 performs keyword category recognition on the target text word and the text word sample based on the text word feature and the text word sample feature to obtain a first keyword category of the target text word and a second keyword category of the text word sample, then the weighting unit 304 performs weighting on the text word features according to the first keyword category to obtain a text feature of each text sample in the text sample pair, and calculates a feature distance between the text features, then, the retrieval unit 305 converges the preset text processing model based on the second keyword category, the labeled keyword category, the characteristic distance, and the labeled semantic matching relationship to obtain a trained text processing model, and retrieves the target text by using the trained text processing model; according to the scheme, the keyword category identification task and the semantic matching task are trained simultaneously through a multi-task framework, the first keyword category is identified, the text word features are weighted, and the word weight identification capability of the text processing model in the semantic matching task is enhanced in an explicit mode, so that the information noise is effectively reduced, and the accuracy of text processing can be improved.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, which shows a schematic structural diagram of the electronic device according to the embodiment of the present invention, specifically:

the electronic device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 10 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the processor 401 is a control center of the electronic device, connects various parts of the whole electronic device by various interfaces and lines, performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby performing overall monitoring of the electronic device. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.

The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.

The electronic device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 through a power management system, so that functions of managing charging, discharging, and power consumption are realized through the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

The electronic device may further include an input unit 404, and the input unit 404 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the electronic device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the electronic device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application program stored in the memory 402, thereby implementing various functions as follows:

For example, a text sample set is obtained, at least one text sample and a semantic text sample corresponding to the text sample are screened from the text sample set, and the semantic text sample is collectedThe method comprises the steps of segmenting a text sample by using a preset text processing model, marking a keyword category in the segmented text word to obtain a text word sample, and marking a semantic matching relation in a text pair consisting of the text sample and a semantic text sample according to the semantic relation between the text sample and the semantic text sample to obtain a text sample pair. And segmenting each text sample into Chinese single words, English words or roots and the like, thereby obtaining the target text word after each text sample in the text sample pair is segmented. Respectively extracting the characteristics of the target text words and the text word samples by adopting a Bert network or XLNET, ELECTRA and other models in a preset text processing model so as to obtain the vector representation (q) of each target text word₁,q₂,q₃) And a vector representation (t) for each text word sample₁,t₂,t₃) And expressing the vector of the target text word as the text word characteristic of the target text word, and expressing the vector of the text word sample as the text word sample characteristic of the text word sample. The method comprises the steps of respectively carrying out normalization processing on text word characteristics and text word sample characteristics by adopting a keyword recognition network in a preset text processing model, mapping the class probability that a target text word belongs to each keyword class according to the normalized text word characteristics to obtain a first keyword class of the target text word, mapping the class probability that a text word sample belongs to each keyword class based on the normalized text word sample characteristics to obtain a second keyword class of the text word sample. Determining text weight of the text word features according to the first keyword category, weighting the text word features based on the text weight, fusing the weighted text word features to obtain text features of each text sample in the text sample pair, and calculating feature distances among the text features. Determining keyword loss information of the text word sample based on the second keyword class and the labeled keyword class, determining text loss information of the text sample pair according to the labeled semantic matching relationship and the characteristic distance, and converging the preset text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model. Acquiring a candidate text set, and processing the candidate text by adopting a trained text processing modelAnd when receiving a query text, screening at least one candidate text in the candidate text set according to the index information and the query text.

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

As can be seen from the above, in the embodiments of the present invention, after a text word sample and a text sample pair are obtained, a preset text processing model is adopted to segment the text sample in the text sample pair, and feature extraction is performed on a target text word and the text word sample after the segmentation to obtain a text word feature of the target text word and a text word sample feature of the text word sample, then, based on the text word feature and the text word sample feature, keyword category recognition is performed on the target text word and the text word sample to obtain a first keyword category of the target text word and a second keyword category of the text word sample, then, according to the first keyword category, weighting is performed on the text word feature to obtain a text feature of each text sample in the text sample pair, and a feature distance between the text features is calculated, then, based on the second keyword category, a keyword category is labeled, Converging the preset text processing model by using the characteristic distance and the labeled semantic matching relation to obtain a trained text processing model, and searching a target text by using the trained text processing model; according to the scheme, the keyword category identification task and the semantic matching task are trained simultaneously through a multi-task framework, the first keyword category is identified, the text word features are weighted, and the word weight identification capability of the text processing model in the semantic matching task is enhanced in an explicit mode, so that the information noise is effectively reduced, and the accuracy of text processing can be improved.

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the embodiment of the present invention provides a computer-readable storage medium, in which a plurality of instructions are stored, and the instructions can be loaded by a processor to execute the steps in any one of the text processing methods provided by the embodiment of the present invention. For example, the instructions may perform the steps of:

For example, a text sample set is obtained, at least one text sample and a semantic text sample corresponding to the text sample are screened from the text sample set, a preset text processing model is adopted to perform word segmentation on the text sample, keyword categories are marked in the segmented text words to obtain a text word sample, and a semantic matching relationship is marked in a text pair consisting of the text sample and the semantic text sample according to the semantic relationship between the text sample and the semantic text sample to obtain a text sample pair. And segmenting each text sample into Chinese single words, English words or roots and the like, thereby obtaining the target text word after each text sample in the text sample pair is segmented. Using preset text placesRespectively extracting the characteristics of target text words and text word samples by a Bert network or XLNet, ELECTRA and other models in the physical model so as to obtain the vector representation (q) of each target text word₁,q₂,q₃) And a vector representation (t) for each text word sample₁,t₂,t₃) And expressing the vector of the target text word as the text word characteristic of the target text word, and expressing the vector of the text word sample as the text word sample characteristic of the text word sample. The method comprises the steps of respectively carrying out normalization processing on text word characteristics and text word sample characteristics by adopting a keyword recognition network in a preset text processing model, mapping the class probability that a target text word belongs to each keyword class according to the normalized text word characteristics to obtain a first keyword class of the target text word, mapping the class probability that a text word sample belongs to each keyword class based on the normalized text word sample characteristics to obtain a second keyword class of the text word sample. Determining text weight of the text word features according to the first keyword category, weighting the text word features based on the text weight, fusing the weighted text word features to obtain text features of each text sample in the text sample pair, and calculating feature distances among the text features. Determining keyword loss information of the text word sample based on the second keyword class and the labeled keyword class, determining text loss information of the text sample pair according to the labeled semantic matching relationship and the characteristic distance, and converging the preset text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model. Acquiring a candidate text set, extracting the characteristics of each candidate text in the candidate text set by adopting a trained text processing model to obtain a candidate text characteristic set, constructing index information corresponding to the candidate text characteristic set according to the candidate text characteristics in the candidate text characteristic set, and screening at least one candidate text in the candidate text set according to the index information and a query text when the query text is received.

Wherein the computer-readable storage medium may include: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the computer-readable storage medium can execute the steps in any text processing method provided in the embodiment of the present invention, the beneficial effects that can be achieved by any text processing method provided in the embodiment of the present invention can be achieved, which are detailed in the foregoing embodiments and will not be described again here.

According to an aspect of the application, there is provided, among other things, a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various alternative implementations of the text processing aspect or the text retrieval aspect described above.

The text processing method, the text processing apparatus, the electronic device, and the computer-readable storage medium according to the embodiments of the present invention are described in detail, and a specific example is applied to illustrate the principles and embodiments of the present invention, and the description of the embodiments is only used to help understanding the method and the core concept of the present invention; meanwhile, for those skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of text processing, comprising:

2. The method of claim 1, wherein weighting the text word features according to the first keyword category to obtain the text features of each of the text sample pairs comprises:

determining the text weight of the text word characteristic according to the first keyword category;

and weighting the text word features based on the text weight, and fusing the weighted text word features to obtain the text features of each text sample in the text sample pair.

3. The method of claim 2, wherein determining the text weight of the text word feature according to the first keyword category comprises:

identifying the category probability of each keyword category in the first keyword category to obtain a first category probability;

screening out the category probability of at least one preset keyword category from the first category probability to obtain a basic category probability;

and fusing the basic category probability to obtain the text weight of the text word characteristics.

4. The method of claim 2, wherein the text sample pair includes a query text sample and a target text sample, and the fusing the weighted text word features to obtain the text features of each text sample in the text sample pair includes:

fusing the weighted text word features to obtain fused text word features;

extracting query text features corresponding to the query text samples and at least one field text feature corresponding to the target text sample from the fused text features;

and fusing the field text features to obtain target field text features, and taking the target field text features and the query text features as the text features of each text sample in the text sample pair.

5. The text processing method according to claim 4, wherein the fusing the field text features to obtain target field text features comprises:

extracting the associated features of the field text features to obtain the associated features of the field text features;

determining an association weight of the field text features based on the association features, wherein the association weight is used for indicating an association relation between the field text features;

and weighting the field text features according to the association weight, and fusing the weighted field text features to obtain target field text features.

6. The method according to any one of claims 1 to 5, wherein the converging a preset text processing model based on the second keyword category, the labeled keyword category, the feature distance, and the labeled semantic matching relationship to obtain a trained text processing model comprises:

determining keyword loss information of the text word sample based on the second keyword category and the labeled keyword category;

determining text loss information of the text sample pair according to the labeled semantic matching relation and the characteristic distance;

and converging the preset text processing model based on the keyword loss information and the text loss information to obtain a trained text processing model.

7. The method of claim 6, wherein determining keyword loss information for the sample of text words based on the second keyword category and the tagged keyword category comprises:

identifying the category probability of each keyword category in the second keyword categories to obtain second category probabilities;

screening out category probabilities corresponding to the labeled keyword categories from the second category probabilities to obtain target category probabilities;

and fusing the target category probability and the labeled keyword category, and calculating the mean value of the fused keyword category to obtain the keyword loss information of the text word sample.

8. The method according to claim 6, wherein determining text loss information of the text sample pair according to the labeled semantic matching relationship and the feature distance comprises:

determining matching parameters of the text sample pairs according to the labeled semantic matching relationship;

and when the matching parameters are preset matching parameters and the characteristic distance is smaller than a preset distance threshold, fusing the matching parameters and the characteristic distance to obtain text loss information of the text sample pair.

9. The method according to claim 8, wherein the fusing the matching parameters and the feature distances to obtain text loss information of the text sample pairs comprises:

calculating a distance difference value between the characteristic distance and the preset distance threshold;

calculating a parameter difference value between the matching parameter and a preset parameter threshold value, and fusing the distance difference value and the parameter difference value;

and fusing the fused difference, the matching parameters and the characteristic distance to obtain text loss information of the text sample.

10. The method according to claim 6, wherein a predetermined text processing network comprises a feature extraction network and a keyword recognition network, and the converging of the predetermined text processing model based on the keyword loss information and the text loss information to obtain the trained text processing model comprises:

acquiring loss weight, and respectively weighting the keyword loss information and the text loss information based on the loss weight;

fusing the weighted keyword loss information and the weighted text loss information to obtain target loss information;

adopting the weighted keyword loss information to converge the keyword recognition network to obtain a trained keyword recognition network;

and converging the feature extraction network by adopting target loss information to obtain a trained feature extraction network, and taking the trained keyword recognition network and the trained feature extraction network as a trained text processing model.

11. The method of claim 10, wherein the performing keyword category recognition on the target text word and the text word sample based on the text word features and the text word sample features to obtain a first keyword category of the target text word and a second keyword category of the text word sample comprises:

respectively carrying out normalization processing on the text word characteristics and the text word sample characteristics by adopting the keyword recognition network;

mapping the category probability of the target text word belonging to each keyword category according to the normalized text word characteristics to obtain a first keyword category of the target text word;

and mapping the category probability of the text word sample belonging to each keyword category based on the normalized text word sample characteristics to obtain a second keyword category of the text word sample.

12. The method according to any one of claims 1 to 5, wherein the retrieving the target text using the trained text processing model comprises:

acquiring a candidate text set, and performing feature extraction on each candidate text in the candidate text set by adopting the trained text processing model to obtain a candidate text feature set;

constructing index information corresponding to the candidate text feature set according to the candidate text features in the candidate text feature set;

and when a query text is received, screening at least one candidate text in the candidate text set as a target text according to the index information and the query text.

13. The method of claim 12, wherein the filtering out at least one candidate text from the candidate text set as a target text according to the index information and the query text comprises:

extracting the characteristics of the query text by adopting the trained text processing model to obtain the query text characteristics of the query text;

based on the index information, at least one candidate text feature corresponding to the query text feature is retrieved from the candidate text feature set to obtain a target candidate text feature;

and screening out candidate texts corresponding to the target candidate text characteristics from the candidate text set to obtain target texts corresponding to the query texts.

14. The method of any of claims 1 to 5, wherein the obtaining a text word sample and text sample pair comprises:

acquiring a text sample set, and screening at least one text sample and a semantic text sample corresponding to the text sample in the text sample set, wherein the semantic text sample is a text sample having a semantic relationship with the text sample;

performing word segmentation on the text sample by adopting the preset text processing model, and marking a keyword category in the text word after word segmentation to obtain a text word sample;

and according to the semantic relation between the text sample and the semantic text sample, labeling a semantic matching relation in a text pair consisting of the text sample and the semantic text sample to obtain a text sample pair.

15. A text processing apparatus, comprising:

16. An electronic device comprising a processor and a memory, the memory storing an application program, the processor being configured to run the application program in the memory to perform the steps of the text processing method according to any one of claims 1 to 14.

17. A computer program product comprising computer program/instructions, characterized in that the computer program/instructions, when executed by a processor, implement the steps in the text processing method of any of claims 1 to 14.

18. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the text processing method according to any one of claims 1 to 14.