CN113869060A

CN113869060A - Semantic data processing method and search method and device

Info

Publication number: CN113869060A
Application number: CN202111115438.0A
Authority: CN
Inventors: 程鸣权; 徐伟; 刘欢; 李雅楠; 王海威; 陈坤斌; 和为
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-09-23
Filing date: 2021-09-23
Publication date: 2021-12-31

Abstract

The application discloses a semantic data processing method, a semantic data searching method and a semantic data searching device, relates to the field of artificial intelligence, and further relates to the fields of intelligent searching, deep learning, natural language processing and the like. The specific implementation scheme is as follows: the semantic data processing method is divided into two stages, namely a first stage, comparison learning is carried out on unsupervised linguistic data in a large number of target fields, and the sequence expression effect of a pre-training coding model is improved; and in the second stage, constructing semi-supervised training data which are difficult to model and easy to separate based on search historical data mining, and training a field text semantic matching model, wherein the text semantic matching model comprises the trained pre-training coding model and a loss function. Therefore, the semantic data processing method can improve the semantic matching effect of the target field and reduce the manual labeling cost of the supervised data.

Description

Semantic data processing method and search method and device

Technical Field

The application relates to the field of artificial intelligence, further relates to the fields of intelligent search, deep learning, natural language processing and the like, and particularly relates to a semantic data processing method and a semantic data searching method and device.

Background

The searching process of the users in the enterprise can be regarded as a matching process of a user input question (query) and the existing articles of the enterprise, and the purpose of the user searching is to obtain the articles which the user wants to find, so that the general searching process is integrally divided into a recall module and a sequencing module, and the common method of the recall module is text literal recall and text semantic recall.

The text semantic matching model is used for the text semantic recall module, the better the effect of the text semantic matching model is, the more relevant the recalled article and the query are, the faster the user can find the article which the user wants; therefore, the effect of improving the text semantic matching model is very helpful for improving the search satisfaction of the user.

Disclosure of Invention

The application provides a semantic data processing method, a semantic data searching device, semantic data searching equipment and a semantic data searching storage medium, so that the text semantic matching effect of a target field is improved.

According to a first aspect of the present application, there is provided a semantic data processing method applied to an intelligent search system, including:

constructing unsupervised training data according to historical search data of a target field, wherein the unsupervised training data comprises a plurality of first training samples;

training a pre-training coding model through the unsupervised training data to obtain a trained pre-training coding model, wherein the pre-training coding model has a data amplification function;

constructing semi-supervised training data according to historical search data of a target field, wherein the semi-supervised training data comprises a plurality of second training samples;

and training a text semantic matching model through the semi-supervised training data to obtain a trained text semantic matching model, wherein the text semantic matching model comprises the trained pre-training coding model and a loss function, and a vector output by the trained pre-training coding model is used as an input of the loss function.

According to a second aspect of the present application, there is provided a search method comprising:

receiving a search request of a terminal, and acquiring a search problem in the search request;

inputting the search question into a pre-trained coding model to obtain a first vector; the pre-training coding model is a pre-training coding model in the text semantic matching model according to the first aspect;

obtaining respective second vectors of a plurality of article topics; the second vector is obtained by inputting the article theme into a pre-trained coding model which is trained in advance, wherein the pre-trained coding model is a pre-trained coding model in the text semantic matching model according to the first aspect;

calculating the similarity between the first vector and the second vector, and determining a target article topic with the similarity meeting a preset condition from the plurality of article topics;

and returning the article corresponding to the target article theme to the terminal.

According to a third aspect of the present application, there is provided a semantic data processing apparatus, including:

the system comprises a first construction module, a second construction module and a third construction module, wherein the first construction module is used for constructing unsupervised training data according to historical search data of a target field, and the unsupervised training data comprises a plurality of first training samples;

the first training module is used for training a pre-training coding model through the unsupervised training data to obtain the trained pre-training coding model, wherein the pre-training coding model has a data amplification function;

the second construction module is used for constructing semi-supervised training data according to historical search data of the target field, and the semi-supervised training data comprise a plurality of second training samples;

and the second training module is used for training a text semantic matching model through the semi-supervised training data to obtain a trained text semantic matching model, wherein the text semantic matching model comprises the trained pre-training coding model and a loss function, and a vector output by the trained pre-training coding model is used as an input of the loss function.

According to a fourth aspect of the present application, there is provided a search apparatus comprising:

the receiving module is used for receiving a search request of a terminal and acquiring a search problem in the search request;

the coding module is used for inputting the search question into a pre-trained coding model which is trained in advance to obtain a first vector; the pre-training coding model is a pre-training coding model in the text semantic matching model of the first aspect;

the acquisition module is used for acquiring respective second vectors of a plurality of article topics; the second vector is obtained by inputting the article theme into a pre-trained coding model which is trained in advance, wherein the pre-trained coding model is a pre-trained coding model in the text semantic matching model of the first aspect;

the calculation module is used for calculating the similarity between the first vector and the second vector and determining a target article topic of which the similarity meets a preset condition from the plurality of article topics;

a returning module for returning the article corresponding to the target article theme to the terminal

According to a fifth aspect of the present application, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect or to enable the at least one processor to perform the method of the second aspect.

According to a sixth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of the first aspect or to enable at least one processor to perform the method of the second aspect.

According to a seventh aspect of the present application, there is provided a computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the method of the first aspect or enables the at least one processor to perform the method of the second aspect.

According to the semantic data processing method, the semantic data searching device, the semantic data searching equipment and the semantic data storing medium, the semantic matching effect of the target field can be improved, and the manual labeling cost of the supervised data is reduced.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow diagram illustrating a method for semantic data processing according to an embodiment of the present application;

FIG. 2 is a block diagram of semantic data processing according to an embodiment of the present application;

FIG. 3 is a block diagram of a SimCSE model;

FIG. 4 is a block diagram of a structure of a text semantic matching model according to an embodiment of the present application;

FIG. 5 is a flow chart diagram of a semantic data processing method according to another embodiment of the present application;

FIG. 6 is a schematic flow chart diagram of a search method according to an embodiment of the present application;

FIG. 7 is a block diagram of a semantic data processing apparatus according to an embodiment of the present application;

FIG. 8 is a block diagram of a semantic data processing apparatus according to an embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing the semantic data processing method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or otherwise described herein.

At present, a semantic matching model acting on a semantic recall module is mainly constructed through 2 ways: the first is to directly use a pre-training model as a semantic matching model; secondly, training a depth model (including a pre-training model) by using an artificially labeled supervised domain corpus to obtain a semantic matching model.

The first problem of directly using the pre-training model as the semantic matching model includes:

the implementation manner of semantic matching is to represent 2 sequences into 2 vectors, measure the correlation degree of 2 sequences in semantics by calculating the similarity of 2 vectors, but the sequence representation obtained by directly using the pre-training model will have a "collapse" phenomenon (i.e. all sequences tend to be encoded into a smaller spatial region, which makes the sequence pairs having completely irrelevant semantics have higher similarity scores), so if the pre-training model is directly used as the semantic matching model, more sequence pairs having completely irrelevant semantics will be recalled; in addition, the pre-training model is trained on a large number of general corpora, and the general corpora have a large difference from the corpora distribution of the enterprise field, so that the semantic matching effect in the field is poor by directly using the pre-training model.

The second problem of using the artificially labeled supervised domain corpora to train the depth model to obtain the semantic matching model includes:

the supervised domain corpus is expensive to obtain and requires expensive manual labeling cost.

In order to improve the effect of the text semantic matching model, field fine tuning (fine-tune) needs to be performed on supervised data, but the supervised data acquisition cost is higher; the effect of pre-training the model is not able to meet our needs without domain fine-tune. Therefore, in order to solve the above problems, embodiments of the present application provide a method for processing semantic data for searching in an enterprise, where the method can greatly improve the semantic matching model effect, and greatly reduce the manual tagging cost of supervised data.

The semantic data processing method provided by the embodiment of the application is applied to an intelligent search system, and the intelligent search system can be an enterprise knowledge search system, an enterprise knowledge question-answering system, a knowledge recommendation system and the like. According to the method, firstly, comparison learning is carried out on unsupervised linguistic data in a large number of target fields to obtain sentence vectors more suitable for the target fields, then, after model difficult and easily-separable semi-supervised training samples are dug out through point spread logs searched in an enterprise, and a text semantic matching model which is higher in generalization and more suitable for the enterprise fields is obtained through loss function training based on optimized pair-wise combination. Fig. 1 is a flowchart of a semantic data processing method according to an embodiment of the present application. The semantic data processing method according to the embodiment of the present application is applicable to the semantic data processing apparatus according to the embodiment of the present application. The semantic data processing device can be configured on an electronic device. As shown in fig. 1, the semantic data processing method may include the following steps.

S101, establishing unsupervised training data according to historical search data of a target field, wherein the unsupervised training data comprises a plurality of first training samples;

the first stage of the semantic data processing method according to the embodiment of the present application is to improve the sequence representation effect in the field of the pre-training coding model. The samples in the same batch are subjected to data amplification through contrast learning, the same data and the data after the data amplification are regarded as positive samples, and all other samples and the data in the batch are regarded as negative samples.

S102, training a pre-training coding model through the unsupervised training data to obtain a trained pre-training coding model, wherein the pre-training coding model has a data amplification function;

the pre-training coding model is trained through the unsupervised training data obtained in the S101, and sentence expression effects in the target field of the pre-training coding model are improved through the data amplification function of the pre-training coding model. Namely, the obtained unsupervised training data is used for comparison and learning so as to improve the sentence expression effect in the target field of the pre-training coding model.

There are many ways of amplifying this data, and as an example, one sequence a is input, and the same sequence a1 is generated by various changes, such as translation, deletion, and dropout.

Among them, dropout can be used as a kind of trim (skill) for training the deep neural network. By ignoring a portion of the feature detectors (leaving a portion of hidden layer node values at 0) in each training batch, the overfitting phenomenon can be significantly reduced. This approach may reduce the interaction between feature detectors (hidden nodes), which means that some detectors rely on others to function.

dropout says that the simple point is: during forward propagation, the activation value of a certain neuron stops working with a certain probability p, so that the model generalization is stronger because the model does not depend too much on some local features.

As a possible implementation mode, data amplification is carried out in a dropout mode, then the distance of the positive sample is reduced and the distance of the negative sample is enlarged through the training loss, and further the sequence representation effect of the pre-training model is improved.

Optionally, the pre-training coding model includes a Simple contextual Learning of sequence entries (Sentence vectors are obtained through contrast Learning) model with a dropout function, and the trained SimCSE model is obtained by training the SimCSE model through unsupervised training data. The structure of the unsupervised SimCSE model is shown in fig. 3, where sensor _1 represents "sequence 1" in a batch of samples, and since the SimCSE model has a dropout function, 2 obtained embedding vectors are different after the same 2 "sequence 1" passes through the SimCSE model with the dropout, so as to achieve the purpose of data amplification, and in order to distinguish the inconsistency of the 2 embedding vectors, the 2 obtained embedding vectors are respectively represented as h1 'and h 1'.

The loss function of the unsupervised SimCSE model is as follows:

wherein, tau is a hyper-parameter,

s103, constructing semi-supervised training data according to historical search data of the target field, wherein the semi-supervised training data comprise a plurality of second training samples;

the second stage of the semantic data processing of the embodiment of the present application is to train the text semantic matching model, and therefore, a plurality of second training samples for training the text semantic matching model need to be acquired. And S104, training a text semantic matching model through the semi-supervised training data to obtain a trained text semantic matching model, wherein the text semantic matching model comprises the trained pre-training coding model and a loss function, and a vector output by the trained pre-training coding model is used as an input of the loss function.

As a possible implementation mode, the composition of the trained pre-training coding model and the loss function is used as a text semantic matching model. And training the text semantic matching model through the second training sample, and continuously adjusting the parameters of the trained pre-training coding model according to the output value of the loss function to finally obtain the trained text semantic matching model.

Optionally, the loss function is a pair-wise loss function.

The loss function of Pair-wise is expressed as follows:

L＝max{0,Si(Q,title-)-Si(Q,title+)+m}

where Si (Q, title-) represents the sequence versus negative sample vector, Si (Q, title +) represents the sequence versus positive sample vector, and m represents the boundary margin value of the loss function.

As shown in fig. 4, the text semantic matching model includes the trained pre-training coding model and a Pair-wise loss (loss function of Pair-wise).

The final output of the Pair-wise penalty function is a similarity score of 0-1.

Pair-wise is actually a loss function used inside the ordering. In a specific application scenario, the prediction results obtained by the recall link further enter the sorting, so that if a loss function in the sorting is selected, the effect of the later sorting is better.

The semantic data processing method of the embodiment of the application comprises two stages as shown in fig. 2, wherein the first stage is used for comparison learning through unsupervised training data, and sentence expression effects in a target field of a pre-training coding model are improved. And in the second stage, firstly, mining semi-supervised training data, training a text semantic matching model consisting of a trained pre-training coding model and a loss function through the semi-supervised training data, and finally improving the text semantic matching effect of the target field.

It should be noted that the unsupervised training data may be constructed using historical search data, wherein the unsupervised training data may include a plurality of first training samples. As a possible implementation manner, the implementation manner of constructing the unsupervised training data according to the historical search data of the target field may be as follows: determining a plurality of search questions and a plurality of article topics according to historical search data, acquiring a plurality of first sequences from the plurality of search questions, and forming a first training sample by the current first sequence in the plurality of first sequences and the current first sequence or other first sequences; and acquiring a plurality of second sequences from a plurality of article topics, and forming a first training sample by the current second sequence in the plurality of second sequences and the current second sequence or other second sequences.

It should be noted that the sequence mentioned in the present application may be words, phrases, sentences, or the like.

For example, in an actual application scenario, as shown in fig. 2, a search question (query) searched by a user and a title (title) of an article clicked by the user may be extracted as training data through a click log searched by a user inside an enterprise, that is, a training sample may include the query input by the user and the title clicked by the user. And after the obtained query and title data are simply cleaned, removing meaningless query and title and the like to form unsupervised training data. The query and the title are used as samples separately, the query and the query itself are used as positive samples, and the query and other queries form negative samples. Similarly, a title and itself are taken as positive samples, and the title and other titles constitute negative samples.

It should be noted that the semi-supervised training data may be constructed by using the historical search data, wherein the semi-supervised training data may include a plurality of second training samples. As a possible implementation manner, the implementation manner of constructing the semi-supervised training data according to the historical search data of the target field may be as follows: determining a plurality of search questions, a plurality of article topics and click relations between the search questions and the article topics according to historical search data, acquiring a plurality of first sequences from the search questions, acquiring a plurality of second sequences from the article topics, and taking sequence pairs formed by the first sequences and the second sequences as second training samples according to the click relations.

It should be noted that the second training sample includes sequence-positive samples and sequence-negative samples.

As a possible implementation manner, the method for acquiring the positive sample by the sequence includes:

acquiring a plurality of sequence pairs of which the first sequence and the second sequence have a click relation;

determining a second target sequence pair of the plurality of sequence pairs that satisfies a click frequency range;

and cleaning the second target sequence pair, and taking the cleaned second target sequence pair as a sequence alignment sample.

As a possible implementation manner, the method for acquiring negative samples by the sequence includes:

and taking second training samples except for the sequence pair positive samples in the plurality of second training samples as simple sequence pair negative samples.

For example, in an actual application scenario, a query searched by a user and a title clicked by the user can be extracted as training data through a click log searched by the user inside an enterprise, the query and the title form a sequence pair (query, title), and the sequence pair (query, title) is used as a second training sample. As shown in fig. 4, (query, title +) composed of query and title corresponding to a click is used as a sequence alignment sample, and (query, title-) composed of query and other title is used as a sequence alignment negative sample. Preferably, the query with high click frequency and the title corresponding to the click are used as sequence alignment samples.

Fig. 5 is a flowchart of a semantic data processing method according to another embodiment of the present application. This embodiment is a further optimization of the embodiment shown in fig. 1. As shown in fig. 5, the semantic data processing method may include the following steps.

S501, according to historical search data of the target field, unsupervised training data is constructed, and the unsupervised training data comprises a plurality of first training samples.

S502, training a pre-training coding model through the unsupervised training data to obtain the trained pre-training coding model, wherein the pre-training coding model has a data amplification function.

It should be noted that the specific implementation processes and principles of steps S501 to S502 are the same as those of steps S101 to S102, and are not described again.

S503, constructing semi-supervised training data according to historical search data of the target field, wherein the semi-supervised training data comprise a plurality of second training samples, and the semi-supervised training data comprise sequence positive samples, sequence negative samples and sequence negative samples difficult to distinguish.

Step S503 is further optimized based on step S103, and in some embodiments, optionally, the negative samples include a simple sequence pair negative sample and a difficult-to-separate sequence pair negative sample, that is, the second training sample further includes a sequence pair difficult-to-separate negative sample.

Optionally, the method for acquiring a difficultly-classified negative sample by the sequence includes:

calculating the similarity score of two sequences in each sequence pair through a pre-training model, and screening out a first target sequence pair with the similarity score meeting a preset range;

and respectively carrying out word segmentation on two sequences in the first target sequence pair, and taking the first target sequence pair with the same word segmentation as a negative sample of sequence pair difficulty in classification.

When the target sequence pairs with the similarity scores meeting the preset range are screened out, as an example, the preset range can be selected from [0.7,0.9 ].

As an example, the ratio of hard-to-divide sequence to negative and simple sequence to negative samples is 1:100, and the ratio of overall sequence to negative samples to sequence to positive samples is about 1: 1.

It should be noted that the pre-training model may be any pre-training model capable of converting a text sequence into a vector, such as BERT, Baidu ERINE, and the like. The pre-training model implementation inputs a sequence, which generates a vector. It is not a semantic match in the traditional sense, rather than inputting two sequences and outputting a score. For example, a query is input to obtain a vector, a title is input to obtain a vector, and the two vectors obtain a similarity score. More models may be implemented here without limitation.

The method not only excavates simple sequence pair negative samples which are easy to be divided by the model, but also excavates difficultly divided sequence pair negative samples which are difficult to be divided by the model, and can effectively improve the generalization effect of the text semantic matching model.

S504, training a text semantic matching model through the semi-supervised training data to obtain a trained text semantic matching model, wherein the text semantic matching model comprises the trained pre-trained coding model and a loss function, and a vector output by the trained pre-trained coding model is used as an input of the loss function; the loss function sets different parameter values for simple negative examples and hard-to-divide negative examples.

Step S504 is further optimized on the basis of step S104, and in some embodiments, the boundary parameter m (margin value) of the loss function of pair-wise is selected to be different values for the sequence pair negative samples and the sequence pair hard-to-separate negative samples. The model gives different attention to the negative samples by the difficult sequence and the negative samples by the simple sequence, and the generalization effect of the model is further improved.

For example, for a simple sequence versus a negative example, the value of m is set a little smaller, and for a hard-to-divide sequence versus a negative example, the value of m is set a little larger. If the value of m is large, the loss function is large, the loss value is large, and the model pays more attention to the difficultly-classified negative sample when the parameter is updated. The semantic data processing method of the embodiment of the application comprises two stages as shown in fig. 2, wherein the first stage is used for comparison learning through unsupervised training data, and sentence expression effects in a target field of a pre-training coding model are improved. And in the second stage, firstly, mining semi-supervised training data, training a text semantic matching model consisting of a trained pre-training coding model and a loss function through the semi-supervised training data, and finally improving the text semantic matching effect of the target field. Furthermore, in the second stage, samples with easily-separable models are excavated, samples with difficultly-separable models are also excavated, and the generalization effect of the semantic matching model can be effectively improved. And the structure of the text semantic matching model is optimized, different margin values are set for samples which are difficult to separate and easy to separate, so that the text semantic matching model gives different attention to the samples which are difficult to separate and easy to separate, and the effect of the semantic matching model is further improved.

The semantic data processing method provided by the embodiment of the application is independent of products, the effect in the semantic matching model field is mainly improved, and the constructed in-field semantic matching model can be applied to multiple aspects of enterprise knowledge search, enterprise knowledge question answering, knowledge recommendation and the like.

By adopting the semantic data processing method of the embodiment, the trained text semantic matching model can be applied to an intelligent search system. According to the search method provided by the embodiment of the application, the method adopts a pre-training coding model in a trained text semantic matching model obtained according to the semantic data processing method, and recalls corresponding search information according to a search request.

According to the semantic data processing method, the trained text semantic matching model is obtained, and a pre-trained coding model in the trained text semantic matching model can be applied to a text semantic recall module in a search process. Fig. 6 is a flow chart of a search method according to another embodiment of the present application. It should be noted that the execution subject of the search method may be an electronic device such as a server. As shown in fig. 6, the search method may include the following steps.

S601, receiving a search request of a terminal, and acquiring a search problem in the search request;

s602, inputting the search question into a pre-training coding model which is trained in advance to obtain a first vector; the pre-training coding model is a pre-training coding model in the trained text semantic matching model of the above embodiment.

And coding the search question to obtain the first vector through a pre-training coding model in the trained text semantic matching model of the embodiment.

S603, obtaining respective second vectors of a plurality of article topics; the second vector is obtained by inputting the article theme into a pre-trained coding model which is trained in advance, wherein the pre-trained coding model is a pre-trained coding model in the trained text semantic matching model of the embodiment;

as a possible implementation manner, the second vectors of the multiple article topics are obtained in advance and stored, and when the second vectors are obtained in the search process, the second vectors only need to be obtained from the storage path of the second vectors.

S604, calculating the similarity between the first vector and the second vector, and determining a target article topic with the similarity meeting a preset condition from the plurality of article topics;

as a possible implementation manner, the similarity between the first vector and the second vector is calculated by using a cosine similarity method, and other similarity calculation methods may also be used, which is not limited herein.

And S605, returning the article corresponding to the target article theme to the terminal.

According to the searching method, the pre-training coding model in the trained text semantic matching model of the embodiment is adopted to code the searching problem and the article theme, so that the matching degree of the searching problem and the article theme is improved, and great help is provided for improving the searching satisfaction degree of the user.

Corresponding to the processing method of semantic data provided by the above embodiment, the embodiment of the present application further provides a processing device of semantic data. Fig. 7 is a schematic structural diagram of a semantic data processing apparatus according to an embodiment of the present application. As shown in fig. 7, the semantic data processing apparatus 700 may include: a first building module 710, a first training module 720, a second obtaining module 730, and a second training module 740.

Specifically, the first constructing module 710 is configured to construct unsupervised training data according to historical search data of a target field, where the unsupervised training data includes a plurality of first training samples;

a first training module 720, configured to train a pre-training coding model through the unsupervised training data to obtain a trained pre-training coding model, where the pre-training coding model has a data amplification function;

a second constructing module 730, configured to construct semi-supervised training data according to historical search data of the target field, where the semi-supervised training data includes a plurality of second training samples;

and a second training module 740, configured to train a text semantic matching model through the semi-supervised training data to obtain a trained text semantic matching model, where the text semantic matching model includes the trained pre-trained coding model and a loss function, and a vector output by the trained pre-trained coding model is used as an input of the loss function.

In some embodiments of the present application, the first building module 710 is specifically configured to:

determining a plurality of search questions and a plurality of article topics according to the historical search data, acquiring a plurality of first sequences from the plurality of search questions, and forming a first training sample by the current first sequence in the plurality of first sequences and the current first sequence or other first sequences;

and acquiring a plurality of second sequences from the plurality of article topics, and forming a first training sample by the current second sequence in the plurality of second sequences and the current second sequence or other second sequences.

In some embodiments of the present application, the second building module 730 is specifically configured to:

determining a plurality of search questions, a plurality of article topics and click relations between the search questions and the article topics according to historical search data, acquiring a plurality of first sequences from the search questions, acquiring a plurality of second sequences from the article topics, and taking sequence pairs formed by the first sequences and the second sequences as second training samples according to the click relations.

In some embodiments of the present application, the second training sample includes a negative sample of sequence pair difficulty score, and when the second constructing module 730 obtains the negative sample of sequence pair difficulty score, the second constructing module is specifically configured to:

In some embodiments of the present application, the second training sample includes a sequence alignment sample and a sequence alignment negative sample, and when the second building module 730 obtains the sequence alignment sample, the second building module is specifically configured to:

cleaning the second target sequence pair, and taking the cleaned second target sequence pair as a sequence alignment sample; and the number of the first and second groups,

when the second constructing module 730 obtains the sequence pair negative sample, it is specifically configured to:

In some embodiments of the present application, the loss function is a pair-wise loss function, and the second training module 740 is further configured to:

and selecting different values for the boundary parameter m of the loss function of the pair of the sequences and the sequence pair hard-to-separate negative samples.

The semantic data processing method comprises two stages, namely a first stage, comparison learning is carried out based on unsupervised data, and the sequence representation effect of a pre-training model is improved; and in the second stage, based on the point spread logs searched in the enterprise, mining and constructing semi-supervised training data which are difficult to construct and easy to separate, and training the field semantic matching model based on the optimized pair-wise loss function.

With regard to the apparatus in the above embodiment, the specific manner and effect of the operations performed by the respective modules have been described in detail in the embodiment related to the construction method, and will not be described in detail here.

By adopting the semantic data processing method of the embodiment, the trained text semantic matching model can be applied to an intelligent search system. According to the embodiment of the application, a search system is provided, and the search system comprises a search unit, wherein the search unit is used for recalling corresponding search information according to a search request by adopting a pre-training coding model in a trained text semantic matching model obtained according to the semantic data processing method.

Corresponding to the searching method provided by the embodiment, the embodiment of the application also provides a searching device. Fig. 8 is a schematic structural diagram of a search apparatus according to an embodiment of the present application. As shown in fig. 8, the search apparatus 800 may include: a receiving module 801, an encoding module 802, an obtaining module 803, a calculating module 804 and a returning module 805.

Specifically, the receiving module 801 is configured to receive a search request of a terminal, and obtain a search problem in the search request;

the encoding module 802 is configured to input the search question into a pre-trained encoding model to obtain a first vector; the pre-training coding model is a pre-training coding model in the trained text semantic matching model of the construction method embodiment;

an obtaining module 803, configured to obtain respective second vectors of multiple article topics; the second vector is obtained by inputting the article theme into a pre-trained coding model which is trained in advance, wherein the pre-trained coding model is a pre-trained coding model in the trained text semantic matching model of the method embodiment;

a calculating module 804, configured to calculate a similarity between the first vector and the second vector, and determine, from the plurality of article topics, a target article topic for which the similarity satisfies a preset condition;

a returning module 805, configured to return the article corresponding to the target article topic to the terminal.

With regard to the searching apparatus in the above embodiment, the specific manner and effect of the operations performed by the respective modules have been described in detail in the embodiment related to the searching method, and will not be described in detail here.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

As shown in fig. 9, the electronic device is a block diagram of an electronic device according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 901, memory 902, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 9 illustrates an example of a processor 901.

Memory 902 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to execute the semantic data processing method provided by the application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the method of processing semantic data provided by the present application.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the semantic data processing method in the embodiment of the present application (for example, the first obtaining module 710, the first training module 720, the second obtaining module 730, and the first training module 740 shown in fig. 7, or, for example, the receiving module 801, the encoding module 802, the obtaining module 803, the calculating module 804, and the returning module 805 shown in fig. 8). The processor 901 executes various functional applications of the server and data processing, i.e., a processing method of semantic data in the above method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 902.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the processing electronics of the semantic data, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include memory located remotely from the processor 901, which may be connected to semantic data processing electronics via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the semantic data processing method may further include: an input device 903 and an output device 904. The processor 901, the memory 902, the input device 903 and the output device 904 may be connected by a bus or other means, and fig. 7 illustrates an example of a connection by a bus.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing electronics for semantic data, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 904 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A semantic data processing method is applied to an intelligent search system, and comprises the following steps:

2. The method of claim 1, wherein the constructing unsupervised training data from historical search data for a target domain, the unsupervised training data comprising a plurality of first training samples comprises:

3. The method of claim 1, wherein the constructing semi-supervised training data from historical search data for a target domain, the semi-supervised training data comprising a plurality of second training samples, comprises:

4. The method of claim 3, wherein the second training sample comprises a sequence pair hard-to-classify negative sample, and the method for obtaining the sequence pair hard-to-classify negative sample comprises:

5. The method of claim 4, wherein the second training samples comprise sequence-positive samples and sequence-negative samples;

the sequence alignment sample acquisition method comprises the following steps:

cleaning the second target sequence pair, and taking the cleaned second target sequence pair as a sequence alignment sample;

the method for acquiring the negative sample by the sequence comprises the following steps:

and taking second training samples except for the sequence pair positive samples in the plurality of second training samples as sequence pair negative samples.

6. The method according to claim 5, wherein the loss function is a pair-wise loss function, and boundary parameters m of the pair-wise loss function are chosen to be different values for the sequence-to-negative samples and the sequence-to-nondifferential negative samples.

7. A search method, comprising:

inputting the search question into a pre-trained coding model to obtain a first vector; the pre-training coding model is the pre-training coding model in the text semantic matching model according to any one of claims 1 to 6;

obtaining respective second vectors of a plurality of article topics; the second vector is obtained by inputting the article theme into a pre-trained coding model which is trained in advance, wherein the pre-trained coding model is a pre-trained coding model in the text semantic matching model according to any one of claims 1 to 6;

8. An apparatus for processing semantic data, comprising:

9. The apparatus of claim 8, wherein the first building block is specifically configured to:

10. The apparatus of claim 8, wherein the second building block is specifically configured to:

11. The apparatus of claim 8, wherein the second training sample comprises a negative sample of sequence pair difficulty score, and the second building block is specifically configured to, when obtaining the negative sample of sequence pair difficulty score:

12. The apparatus of claim 11, wherein the second training samples comprise sequence-aligned samples and sequence-aligned negative samples, and the second building block, when obtaining the sequence-aligned samples, is specifically configured to:

the second building block is specifically configured to, when obtaining the sequence pair negative sample:

13. The method of claim 12, wherein the loss function is a pair-wise loss function, the second training module further to:

14. A search apparatus, comprising:

the coding module is used for inputting the search question into a pre-trained coding model which is trained in advance to obtain a first vector; the pre-training coding model is the pre-training coding model in the text semantic matching model according to any one of claims 1 to 6;

the acquisition module is used for acquiring respective second vectors of a plurality of article topics; the second vector is obtained by inputting the article theme into a pre-trained coding model which is trained in advance, wherein the pre-trained coding model is a pre-trained coding model in the text semantic matching model according to any one of claims 1 to 6;

and the returning module is used for returning the article corresponding to the target article theme to the terminal.

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or to enable the at least one processor to perform the method of claim 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-6 or to enable at least one processor to perform the method of claim 7.

17. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6 or to enable said at least one processor to carry out the method of claim 7.