CN117708352A

CN117708352A - Data processing method, device, equipment and storage medium

Info

Publication number: CN117708352A
Application number: CN202311814601.1A
Authority: CN
Inventors: 佘萧寒; 邱雪涛; 张思璐; 王宇; 王阳
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-15

Abstract

The application provides a data processing method, a device, equipment and a storage medium, which are applied to an information retrieval platform. And responding to the search instruction to acquire the text to be processed, acquiring each candidate keyword of the text to be processed, and generating a candidate keyword set of the text to be processed according to each candidate keyword. For each candidate keyword, determining a semantic similarity score corresponding to each candidate keyword through a feature model and a semantic similarity algorithm, wherein the feature model is obtained through training a pre-training model. And obtaining the maximum value from the semantic similarity score corresponding to each candidate keyword, determining the candidate keyword corresponding to the maximum value as a target keyword, and carrying out information retrieval according to the target keyword to obtain a retrieval result. The feature model obtained through semi-supervised training is used for obtaining the target keywords for information retrieval, and the retrieval effect can be optimized by optimizing the extraction effect of the target keywords, so that the recall rate and the accuracy of the information retrieval platform are improved, and the user experience of the information retrieval platform is improved.

Description

Data processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a data processing method, apparatus, device, and storage medium.

Background

Information retrieval is widely used in business lines as a main way of information query and acquisition, such as business lines including, but not limited to, merchant information verification (business information retrieval, map information retrieval, etc.), office information retrieval, innovation technology application, etc. The keyword (query) extraction effect often determines the accuracy and the association degree of the final returned result of the information retrieval system as an important component of information retrieval.

The prior art mainly aims at two methods for extracting keywords, one is a supervised method, including but not limited to a method of sequence labeling, sequence generation and the like, and is based on a large-scale labeled corpus and a method of deep learning modeling, a functional relation between full text-keyword labels or full text-keywords is fitted, and then the method is generalized to a text without labels to extract the keywords; the other method is an unsupervised method, including but not limited to a rule method and a deep learning method, which extracts text vectorization characteristics based on a rule or a deep neural network, and then compares the association degree between the full text and the keywords through scoring (such as conditional probability, cosine similarity and the like), so as to finally form a keyword candidate list.

However, the above-mentioned existing keyword extraction method has some drawbacks, such as the former method lacks a large-scale labeling corpus, and the manual labeling is time-consuming and labor-consuming, and is difficult to perform field adaptation, and when the field is migrated, the extraction model needs to be retrained; the latter method has lower accuracy than the supervised method, and has the problems of difficult sentence preference, neglecting context information and the like which are difficult to solve.

Therefore, in order to improve the information retrieval effect, it is highly desirable to optimize the keyword extraction effect to overcome the defects of the existing keyword extraction scheme.

Disclosure of Invention

The application provides a data processing method, a device, equipment and a storage medium, which are used for overcoming the defects of the existing keyword extraction method in information retrieval so as to optimize the information retrieval effect and improve the accuracy and recall rate of the retrieval result.

In a first aspect, the present application provides a data processing method, applied to an information retrieval platform, including:

responding to the retrieval instruction to obtain a text to be processed;

acquiring each candidate keyword of a text to be processed, and generating a candidate keyword set of the text to be processed according to each candidate keyword;

determining a semantic similarity score corresponding to each candidate keyword through a feature model and a semantic similarity algorithm aiming at each candidate keyword, wherein the feature model is obtained by performing semi-supervised training on a pre-training model;

And obtaining a maximum value from the semantic similarity scores corresponding to each candidate keyword, and determining the candidate keywords corresponding to the maximum value as target keywords so as to obtain a search result by carrying out information search according to the target keywords.

In one possible design, the obtaining each candidate keyword of the text to be processed includes:

and acquiring a first candidate keyword of the text to be processed by using an unsupervised keyword extraction tool, and acquiring a second candidate keyword of the text to be processed by using a random mask strategy, wherein each candidate keyword comprises the first candidate keyword and the second candidate keyword.

In one possible design, the determining, by using the feature model and the semantic similarity algorithm, the semantic similarity score corresponding to each candidate keyword includes:

acquiring a plurality of feature vectors corresponding to the current candidate keywords through the feature model;

acquiring semantic similarity scores corresponding to the current candidate keywords through the semantic similarity algorithm according to the plurality of feature vectors;

the feature vectors comprise feature vectors of the text to be processed, feature vectors of the current candidate keywords and feature vectors of mask texts corresponding to the current candidate keywords.

In one possible design, performing semi-supervised training on the pre-trained model to obtain the feature model includes:

acquiring a data set to be trained, wherein the data set to be trained comprises a plurality of training texts, keywords corresponding to each training text and a field to be masked corresponding to each training text;

training the pre-training model by utilizing the data set to be trained according to a preset loss function and a parameter optimization algorithm, and determining the pre-training model after finishing training as the characteristic model;

the preset loss function is obtained according to a plurality of semantic similarity scores of each training text, and the plurality of semantic similarity scores of each training text are used for representing text information carried by keywords corresponding to each training text and context information of each training text after masking the corresponding keywords.

In one possible design, the acquiring the data set to be trained includes:

acquiring an unlabeled data set, wherein the unlabeled data set comprises the plurality of training texts;

manually marking the keywords of each training text to obtain manually marked keywords, and automatically marking each training text by using the unsupervised keyword extraction tool to obtain automatically marked keywords;

Generating a labeling data set according to the manual labeling keywords and the automatic labeling keywords corresponding to each training text, wherein the labeling data set comprises each training text and the keywords corresponding to each training text;

for each training text, acquiring a field to be masked corresponding to each training text through a random selection tool;

and generating the data set to be trained according to the labeling data set and the field to be masked corresponding to each training text.

In one possible design, the unsupervised keyword extraction tool includes any one of a phrase generation model, a keyword screening model, and a phrase extraction model, and the random selection tool includes a word segmentation tool.

In one possible design, the pre-trained model includes any one of a BERT model, a GPT, a RoBERTa model, an XLNET model.

In one possible design, the obtaining the text to be processed in response to the retrieving instruction includes:

receiving service information sent by a service end, and generating the search instruction according to the service information;

and acquiring a characteristic field of the service information according to the retrieval instruction to obtain the text to be processed according to the characteristic field, wherein the characteristic field is used for representing the service to be handled corresponding to the service information.

In one possible design, the to-do business includes any of merchant information verification, office information retrieval, and database retrieval.

In a second aspect, the present application provides a data processing apparatus for use in an information retrieval platform, comprising:

the acquisition module is used for responding to the retrieval instruction to acquire the text to be processed;

the generation module is used for acquiring each candidate keyword of the text to be processed and generating a candidate keyword set of the text to be processed according to each candidate keyword;

the first processing module is used for determining semantic similarity scores corresponding to each candidate keyword through a feature model and a semantic similarity algorithm aiming at each candidate keyword, wherein the feature model is obtained by performing semi-supervised training on a pre-training model;

and the second processing module is used for acquiring a maximum value from the semantic similarity scores corresponding to each candidate keyword, determining the candidate keywords corresponding to the maximum value as target keywords, and carrying out information retrieval according to the target keywords to obtain retrieval results.

In one possible design, the acquiring and generating module is specifically configured to:

In one possible design, the first processing module is specifically configured to:

In one possible design, the data processing apparatus further includes:

the system comprises a sample acquisition module, a mask processing module and a mask processing module, wherein the sample acquisition module is used for acquiring a data set to be trained, and the data set to be trained comprises a plurality of training texts, keywords corresponding to each training text and a field to be masked corresponding to each training text;

the model training module is used for training the pre-training model by utilizing the data set to be trained according to a preset loss function and a parameter optimization algorithm, and determining the pre-training model after finishing training as the characteristic model;

In one possible design, the sample acquisition module is specifically configured to:

In one possible design, the acquiring and generating module is further configured to:

In a third aspect, the present application provides an electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement any one of the possible data processing methods provided in the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium having stored therein computer executable instructions which when executed by a processor are adapted to carry out any one of the possible data processing methods provided in the first aspect.

In a fifth aspect, the present application provides a computer program product comprising computer-executable instructions for implementing any one of the possible data processing methods provided in the first aspect when executed by a processor.

The application provides a data processing method, a device, equipment and a storage medium. Firstly, a text to be processed is obtained in response to a search instruction, then candidate keywords of the text to be processed are obtained, and a candidate keyword set of the text to be processed is generated according to the candidate keywords. And then determining the semantic similarity score corresponding to each candidate keyword through a feature model and a semantic similarity algorithm aiming at each candidate keyword, wherein the feature model is obtained through training a pre-training model. And obtaining the maximum value from the semantic similarity score corresponding to each candidate keyword, determining the candidate keyword corresponding to the maximum value as a target keyword, and finally carrying out information retrieval according to the target keyword to obtain a retrieval result. And acquiring target keywords through the feature model obtained through semi-supervised training to perform information retrieval, wherein the extraction effect of the target keywords is optimized, so that the retrieval effect can be optimized, the recall rate and the accuracy of the information retrieval platform are improved, and the user experience of the information retrieval platform is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, a brief description will be given below of the drawings that are needed in the embodiments or the prior art descriptions, it being obvious that the drawings in the following description are some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application;

fig. 2 is a flow chart of a data processing method provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a feature model construction provided in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another data processing apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of methods and apparatus consistent with aspects of the present application as detailed in the accompanying claims.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims of this application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be capable of operation in sequences other than those illustrated or described herein, for example. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the technical scheme of the application, the related information such as the text to be processed, the training sample and the like is collected, stored, used, processed, transmitted, provided, disclosed and the like, which are all in accordance with the regulations of related laws and regulations, and the public order is not violated.

Keyword (query) extraction is used as an important component of information retrieval, and the extraction effect of the keyword (query) extraction often determines the accuracy and the association degree of a final returned result of an information retrieval system. The two methods for keyword extraction in the prior art are a supervised method and an unsupervised method respectively. Both suffer from some drawbacks. For example, the former method lacks a large-scale labeling corpus, manual labeling is time-consuming and labor-consuming, field self-adaption is difficult to perform, and the model extraction needs to be retrained after the field migration; the latter method has lower accuracy than the supervised method, and has the problems of difficult sentence preference, neglecting context information and the like which are difficult to solve. Therefore, in order to improve the information retrieval effect, it is necessary to optimize the keyword extraction effect to overcome the defects of the existing keyword extraction scheme.

In view of the foregoing problems in the prior art, the present application provides a data processing method, apparatus, device, and storage medium, where the data processing method may be applied to an information retrieval platform. The inventive concept of the data processing method provided by the application is as follows: and for keyword extraction in the information retrieval process, providing a semi-supervised training to obtain a feature model, determining the semantic similarity corresponding to each candidate keyword of the text to be processed by utilizing the feature model in combination with a semantic similarity algorithm, and further determining the candidate keyword corresponding to the maximum value of the semantic similarity as a target keyword to finish keyword extraction. The method has the advantages that the data volume to be marked is not high, model training can be realized by using a mode of mixing a small-scale marked corpus with a large-scale unmarked corpus, the length of keywords and the context information are comprehensively considered in the design of a loss function of model training, and the problem possibly generated by the existing keyword extraction method can be relieved. The retrieval effect of the information retrieval platform can be optimized by optimizing the keyword extraction effect, the recall rate and the accuracy of the information retrieval platform are improved, and further the user experience of the information retrieval platform can be improved.

Fig. 1 is a schematic view of an application scenario provided in an embodiment of the present application. As shown in fig. 1, information retrieval is often required in a service line developed by some organizations, such as a deployment service end 100 executing the service line to develop a service. The business to be handled in the business line may be any one of, for example, merchant information verification, office information retrieval, and database retrieval.

In the information retrieval of the development business. For example, in the process of pushing the business information map association verification requirements related to the business operation center, the situation that the map information retrieval result is missed due to overlong business name and address information and excessive redundant information may be encountered. For example, business operation center network access merchant name is: "wild zoo gas station of the division of the golden hall sales of the Sichuan Chengdu of China, the limited company of petroleum and natural gas, the corresponding map poi (Point of Information) is called" Chinese petroleum gas station (golden hall wild zoo station) ". If full text search is performed on the map search engine using the merchant name, the corresponding map poi cannot be obtained. And the corresponding map poi can be obtained by searching the map poi by using the keyword 'Chinese petroleum'. It can be seen that keyword extraction is an important component of information retrieval, and the keyword extraction effect often determines the accuracy and association degree of the retrieval result returned by the information retrieval platform 200 to the service end 100. The information retrieval platform 200 is configured to execute the data processing method provided by the embodiment of the application to realize information retrieval, and the retrieval effect of the information retrieval platform 200 is optimized by optimizing the keyword extraction effect, so that the recall rate and the accuracy of the information retrieval platform 200 are improved, and the user experience of the information retrieval platform 200 is improved.

It can be understood that, the service side 100 may be configured with electronic devices such as a computer, a notebook computer, a smart phone, and an intelligent wearable device, and in this embodiment of the present application, the device type configured by the service side 100 is not limited, and in fig. 1, the service side 100 is illustrated by using a computer as an example. The information retrieval platform 200 may be configured with electronic devices such as a server, a server cluster, a computer, a notebook computer, a smart phone, and an intelligent wearable device, and the type of the device configured by the information retrieval platform 200 is not limited in this embodiment, and the information retrieval platform 200 is illustrated in fig. 1 by taking a computer as an example.

It should be noted that the application scenario described above is merely illustrative, and the data processing method provided in the embodiment of the present application includes, but is not limited to, the application scenario listed above.

Fig. 2 is a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 2, the data processing method provided in the embodiment of the present application includes:

s101: and responding to the search instruction to acquire the text to be processed, acquiring each candidate keyword of the text to be processed, and generating a candidate keyword set of the text to be processed according to each candidate keyword.

For example, the information retrieval platform receives service information sent by the service end, so as to generate a retrieval instruction according to the service information, wherein the retrieval instruction is used for instructing the information retrieval platform to retrieve information according to the service information. The information retrieval platform obtains the feature field contained in the service information, for example, the feature field may be an information fragment in the service information, which represents the retrieval condition. And the information retrieval platform further determines the acquired characteristic field as a text to be processed.

In one possible design, the text to be processed may be, for example, text for representing a merchant name, merchant information, office information, and database information, where the specific content of the text to be processed is determined by the specific content of the business to be handled, and the embodiment of the present application is not limited to the specific content of the text to be processed.

After the text to be processed is obtained, candidate keywords of the text to be processed are further obtained, and a candidate keyword set of the text to be processed is formed by taking the candidate keywords as set elements. In other words, the candidate keyword set of the text to be processed is a set composed of each candidate keyword of the text to be processed as a set element.

In one possible design, possible implementations of obtaining candidate keywords for the text to be processed include:

The first candidate keywords of the text to be processed are obtained by using an unsupervised keyword extraction tool, for example, for the text to be processed (doc= "wild zoo gas station of the Jintao sales division of the Sichuan Cheng of China oil and gas stock, inc.), the obtained candidate keywords are automatically marked as {" Chinese oil "," Jintao "," wild zoo "}, by using the unsupervised keyword extraction tool. The number of the first candidate keywords may be one or more, and is specifically determined by specific content of the text to be processed. In some embodiments, the unsupervised keyword extraction tool may include, but is not limited to, any of a phrase generation model (such as a PKE unsupervised keyword extraction model), a keyword screening model (such as an EmbedRank), and a phrase extraction model (such as a sifank).

Meanwhile, a second candidate keyword of the text to be processed may be acquired using a random mask policy. For example, taking a text to be processed (doc= "wild zoo gas station of the san francisco sales division of the san francisco of the Sichuan of China"), a random masking strategy is used to randomly extract candidate keywords in the text to be processed, which are { "stock limited", "natural gas", "san francisco", "gas station" }. The random masking strategy can be to randomly extract keywords and introduce bias by human. Accordingly, the number of second candidate keywords may be one or more, specifically determined by the specific content of the text to be processed.

The obtained first candidate keywords and second candidate keywords of the text to be processed are defined as candidate keywords of the text to be processed, and the candidate keywords are taken as set elements to form a candidate keyword set of the text to be processed. For example, the candidate keywords of the text to be processed are listed to form a candidate keyword set of the text to be processed (doc= "Chinese oil, gold hall, sichuan, cheng, etc., wild zoo gas station" of the company Sichuan, inc.) as { "Chinese oil", "gold hall", "wild zoo", "stock, inc., natural gas", "Sichuan, cheng, etc.

S102: and determining the semantic similarity score corresponding to each candidate keyword according to the feature model and the semantic similarity algorithm aiming at each candidate keyword.

And determining the semantic similarity score of each candidate keyword according to the feature model and the semantic similarity algorithm which are trained in advance aiming at each candidate keyword in the candidate keyword set of the text to be processed. The feature model can be obtained by performing semi-supervised training on the pre-training model. The feature model is used to calculate a plurality of feature vectors for each candidate keyword. The plurality of feature vectors of each candidate keyword respectively comprise a feature vector of the text to be processed, a feature vector of each candidate keyword, and a feature vector of the mask text of each candidate keyword. The mask text refers to a text to be processed after masking the current candidate keywords.

For example, the text to be processed is doc= "wild zoo gas station of the company of the division of the Jintang sales of Sichuan Chengdu of China oil and gas stock company"; candidate keyword set query= { "chinese oil", "gold hall", "wild zoo", "stock limited", "natural gas", "fujingdu", "gas station" }; the MASK text is the elements in the collection Doc-query= { "[ MASK ] natural gas stock limited company, sichuan capital sales division wild zoo gas station," Chinese oil and natural gas stock limited company, sichuan capital [ MASK ] sales division wild zoo gas station, "" Chinese oil and natural gas stock limited company, sichuan capital sales division [ MASK ] gas station, "" Chinese oil and natural gas [ MASK ] Sichuan capital sales division wild zoo gas station, "" Chinese oil [ MASK ] stock limited company, sichuan capital sales division wild zoo gas station, "" Chinese oil and natural gas stock limited company, [ MASK ] gold sales division wild zoo gas station, "" Chinese oil and natural gas stock limited company, sichuan capital sales division wild zoo [ MASK ] }.

In one possible design, a possible implementation of step S102 includes:

and acquiring a plurality of feature vectors corresponding to the current candidate keywords through a feature model according to each candidate keyword.

For example, taking the current candidate keyword query= "Chinese petroleum" as an example, taking the to-be-processed model, the current candidate keyword and the mask text corresponding to the current candidate keyword as the inputs of the feature model, and obtaining corresponding outputs, namely, a plurality of feature vectors corresponding to the current candidate keyword.

Specifically, inputs=token zer (inputs= [ "wild zoo gas station of Sichuan Chengdu gold hall sales division of China, china Petroleum", "[ MASK", "]Wild zoo gas station of Sichuan Chengdu gold hall sales division of natural gas stock limited company "]，return_tensors＝"pt")，outputs＝model(**inputs)，features＝outputs["pooler_output"]. Wherein, features [0 ]]Feature vector H for text Doc to be processed _D ，features[1]Feature vector Hq, features [2 ] for the current candidate keyword "Chinese Petroleum ]]Is the feature vector H of the mask text corresponding to the text to be processed after masking Chinese Petroleum, namely the current candidate keyword _D-q 。

After obtaining a plurality of feature vectors corresponding to the current candidate keywords, obtaining semantic similarity scores corresponding to the current candidate keywords through a semantic similarity algorithm according to the feature vectors.

In some embodiments, the semantic similarity algorithm may be represented, for example, by equation (1):

Similairy＝α·sim(Doc,query)-(1-α)·sim(Doc,Doc-query) (1)

where α represents a semantic coefficient, such as α=0.5, sim (Doc, query) represents a similarity between the text to be processed and the current candidate keyword, and sim (Doc, doc-query) represents a similarity between the text to be processed and the mask text of the current candidate keyword that has been drunk. And calculating each similarity by adopting the corresponding feature vector. Therefore, substituting the feature vector of the text to be processed, the similarity of the current candidate keyword and the feature vector of the mask text corresponding to the current candidate keyword into formula (1) can obtain the semantic similarity score corresponding to the current candidate keyword "Chinese petroleum", namely, formula (1) becomes the following formula (2):

Similarity＝0.5*(H _D ·H _q /(|H _D |*|H _q |)-0.5*(H _D ·H _D-q /(|H _D |*|H _D-q |) (2)

therefore, the semantic similarity score corresponding to the current candidate keyword can be obtained through the formula (2). And (3) obtaining the semantic similarity score corresponding to each candidate keyword through a formula (1) based on a plurality of feature vectors corresponding to each candidate keyword.

S103: and obtaining a maximum value from the semantic similarity score corresponding to each candidate keyword, and determining the candidate keyword corresponding to the maximum value as the target keyword.

After obtaining the semantic similarity score corresponding to each candidate keyword of the text to be processed, obtaining the maximum value of the semantic similarity score from the obtained semantic similarity score, and determining the candidate keyword corresponding to the maximum value as the target keyword. For example, "Chinese petroleum" corresponds to the highest semantic similarity score, and is the target keyword.

S104: and carrying out information retrieval according to the target keywords to obtain a retrieval result, and returning the retrieval result to the service end.

The information retrieval platform carries out information retrieval according to the target keywords to obtain corresponding retrieval results, and then returns the retrieval results to the service end, so that the service end executes corresponding services according to the retrieval results. The information retrieval method is not limited to the mode that the information retrieval platform retrieves information according to the extracted target keywords, and the information retrieval method can be performed according to a search engine corresponding to the information retrieval platform in an actual working condition.

The data processing method provided by the embodiment of the application is applied to an information retrieval platform, firstly, a text to be processed is obtained in response to a retrieval instruction, then, candidate keywords of the text to be processed are obtained, and a candidate keyword set of the text to be processed is generated according to the candidate keywords. And then determining the semantic similarity corresponding to each candidate keyword through a feature model and a semantic similarity algorithm aiming at each candidate keyword, wherein the feature model is obtained through training a pre-training model. And obtaining a semantic similarity maximum value from the semantic similarity corresponding to each candidate keyword, determining the candidate keywords corresponding to the semantic similarity maximum value as target keywords, and finally carrying out information retrieval according to the target keywords to obtain a retrieval result. And acquiring target keywords through the feature model obtained through semi-supervised training to perform information retrieval, wherein the extraction effect of the target keywords is optimized, so that the retrieval effect can be optimized, the recall rate and the accuracy of the information retrieval platform are improved, and the user experience of the information retrieval platform is improved.

As described in the above embodiment, the feature model is obtained by performing semi-supervised training on the pre-training model, and the semi-supervised training has low requirement on the data volume to be marked, so that model training can be implemented by using a mode of mixing a small-scale marked corpus with a large-scale unmarked corpus, and the problems possibly generated by the existing keyword extraction method can be relieved by comprehensively considering the keyword length and the context information in the design of the loss function of model training. In one possible design, a possible implementation of the resulting feature model is shown in fig. 3. Fig. 3 is a schematic flow chart of constructing a feature model according to an embodiment of the present application.

As shown in fig. 3, the embodiment of the present application includes:

s201: and obtaining an unlabeled dataset.

Wherein the unlabeled dataset comprises a plurality of training texts.

And collecting a plurality of training texts to form an unlabeled data set, namely taking the training texts as elements to form a set, and obtaining the unlabeled data set. Let unlabeled dataset original_docs= { Doc1, doc2, …, docn }, wherein each element is a training text, such as a certain training text docx= "Chinese oil and gas stock company Shaanxi, western An Yanta sales division, datang cottonrose garden filling station".

S202a: and carrying out manual keyword labeling on each training text to obtain manual labeled keywords.

Step S202a is data preparation by supervised means, e.g. Doc for each training text _i E original_docs, manually labeled Doc _i Corresponding keyword query _i Obtaining the manually marked keyword query _i . Such as: doc _i = "Chinese Petroleum and Natural gas stock limited company Shaanxi Anyanta sales division Datang cottonrose hibiscus garden filling station", query _i1 The = "Chinese petroleum" is the manually labeled keyword obtained by manually labeling the training text.

S202b: and carrying out automatic labeling on each training text by using an unsupervised keyword extraction tool to obtain automatic labeling keywords.

Step S202b is to perform data preparation by using an unsupervised means, such as for Doc _j E, original_Docs, and using an unsupervised keyword extraction tool, automatically labeling a keyword query corresponding to Docj _i2 And obtaining the automatic labeling keywords. Such as: doc _i = "Chinese Petroleum and Natural gas stock limited company Shaanxi Anyanta sales division Datang cottonrose hibiscus garden filling station", query _i2 The = { "Chinese oil", "Yanta", "big Tang Furong garden" } is the automatic labeling keyword corresponding to the training text.

S203: and generating a labeling data set according to the manual labeling keywords and the automatic labeling keywords corresponding to each training text.

For each training text, the manual labeling keyword query corresponding to each training text is selected _i1 And automatically labeling keyword query _i2 Merging into query _i Each training text is then combined to generate the annotation dataset annotated_docs = { (Doc) ₁ ,query ₁ ),(Doc ₂ ,query ₂ ),…,(Docn,query _n ) }. The annotation data set comprises each training text and keywords corresponding to each training text, and the keywords corresponding to each training text comprise manual annotation keywords and automatic annotation keywords of each training text.

In one possible design, the automatic annotation keyword generated by the unsupervised keyword extraction tool is a keyword query _i2 May include, in addition to itself, a relevance score _i2 . The threshold value can be set according to the score of the keyword, and the automatic labeling keyword with score exceeding the threshold value is added to the query _i In other fields that may be added to the field to be masked rand _i In (3) training as a negative example.

S204: and acquiring a field to be masked corresponding to each training text by a random selection tool for each training text.

For Doc _i E, an unotated_Docs, randomly selecting part of the fields rand by a random selection tool _i Hiding, such as: doc _i = "Chinese Petroleum and Natural gas stock limited company Shaanxi Anyanta sales division Datang cottonrose hibiscus garden filling station", length 32, using randint to generate [0,31 ]]Intercepting the field corresponding to the integer as the field to be masked, such as rand _i = "stock limited".

In one possible design, the Doc is a combination of a package of natural language processing (Natural Language Processing, NLP) such as jieba, LTP, hanlp _i The word "chinese/oil/gas/stock/limited/company/shanxi/west an/goose tower/sales/branch/large Tang Furong garden/gas station" was split and the mask was made in word units. Randomly selected tools may include, but are not limited to, the word segmentation tools listed above.

S205: and generating a data set to be trained according to the marking data set and the field to be masked corresponding to each training text.

For example, integrating the labeling data set and the fields to be masked corresponding to each training text to obtain a set of each training text, keywords corresponding to each training text and the fields to be masked corresponding to each training text as the number to be trainedA data set. For example, the data set to be trained may be represented as follows: corpus= { (Doc) ₁ ,query ₁ ,rand ₁ ),(Doc ₂ ,query ₂ ,rand ₂ ),…,(Doc _n ,query _n ,rand _n ) }. The data set to be trained comprises a plurality of training texts, keywords corresponding to each training text and fields to be masked corresponding to each training text.

So far, the preparation of training data in model training is realized through steps S201 to S205, namely, a data set to be trained is obtained.

S206: training the pre-training model by utilizing the data set to be trained according to a preset loss function and a parameter optimization algorithm, and determining the pre-training model with the training being finished as a characteristic model.

And training the pre-training model by utilizing the data set to be trained. The pre-training model is used for obtaining the characteristic vector of each element in the data set to be trained. In some embodiments, the pre-training model may include, but is not limited to, any of a BERT model, a GPT, a RoBERTa model, an XLNET model.

Taking BERT model as an example, the pre-training model respectively aims at data Doc corresponding to each element in the data set to be trained aiming at triples (Doci, queryi, randi) E Corpus _i 、query _i 、rand _i Inputting the characteristic vector into the BERT model to obtain a corresponding characteristic vector, and substituting the obtained characteristic vector into a preset loss function shown in a formula (3):

Loss＝max{β·[sim(Doc,query)-sim(Doc,rand)]+

(1-β)·[sim(Doc,Doc-rand)-sim(Doc,Doc-query)]}(3)

the function sim is used for calculating the similarity degree of feature vectors corresponding to two parameters, doc, query, rand respectively represents the feature vectors corresponding to the training text, the keywords corresponding to the training text and the fields to be masked corresponding to the training text, doc-query represents the feature vectors of the training text after masking the keywords corresponding to the training text, doc-rand represents the feature vectors of the training text after masking the fields to be masked, and beta is the weight parameter of the two sim functions.

As shown in equation (3), the preset penalty function is constructed from a plurality of semantic similarity scores for each training text. The plurality of semantic similarity scores of each training text can represent text information carried by a keyword corresponding to each training text and context information after each training text covers the corresponding keyword. Specifically, the precondition for designing the preset loss function is that the keyword query corresponding to the training text basically covers semantic information carried by the Doc of the training text, so that the semantic similarity between the Doc and the query is significantly higher than the Doc of the training text and the to-be-masked field rand corresponding to the training text. However, the final optimization result is easily drawn to the long difficult sentence direction only by means of the constraint condition, and the context information in the Doc of the training text is ignored, so that the second half term of the formula (3) is added, and semantic similarity between the masked Doc-query, doc-rand and the original Doc of the training text is calculated respectively by masking the query and rand parts in the Doc. Because the field rand to be masked corresponding to the training text is a random mask, the probability of carrying key semantic information is obviously lower than that of the keyword query corresponding to the training text, so that the semantic similarity between the training text Doc and the training text Doc-rand after masking the field to be masked is higher than that between the training text Doc and the training text Doc-query after masking the keyword corresponding to the training text. The preset loss function can be constructed by comprehensively considering semantic information carried by the keyword query itself corresponding to the training text and the context information left by the query sequelae through a front term and a rear term, so that the problems of long-preference difficult sentence, neglecting the context information and the like possibly generated by the traditional method can be well relieved.

Further, the model parameters of the BERT model are iteratively updated by using a parameter optimization algorithm, so that model training is completed when a preset loss function reaches a convergence condition, optimal model parameters of the BERT model are determined, the BERT model containing the optimal model parameters is determined as a feature model, and model training is finished.

In one possible design, the parameter optimization algorithm includes, but is not limited to, any of the other optimization methods such as gradient descent method, gradient ascent method, quasi-newton method, etc., and the specific content of the parameter optimization algorithm is not limited in the embodiments of the present application.

For example, in triples (Doc _i ,query _i ,rand _i ) For example, = ("chinese oil and gas stock limited company, shanxi western security sales division, da tang cottonrose hibiscus garden filling station", "chinese oil", "stock limited company"), there are:

Doc _i -query _i ＝“[MASK]natural gas stock limited company Shaanxi western Yanta sales division Datang cottonrose garden filling station ";

Doc _i -rand _i = "chinese petroleum gas [ MASK ]]Shaanxi's wild goose tower sales division large Tang cottonrose garden filling station).

Input=token zer (input= [ "Chinese oil and gas stock limited company Shaanxi and western Ann Yanta sales division large Tang cottonrose garden gas station", "Chinese oil", "[ MASK", in model training ]Natural gas stock limited company, shaanxi, western Anyan tower sales division, datang cottonrose hibiscus garden gas station "," stock limited company "," Chinese petroleum and natural gas [ MASK ]]Filling station of Datang cottonrose garden of Shaanxi's Yanta sales division "],return_tensors＝"pt")，outputs＝model(**inputs)，features＝outputs["pooler_output"]. Then a feature vector H of the training text is obtained _D’ ＝features[0]Keyword H corresponding to training text _q’ ＝features[1]Feature vector H of training text after masking keywords corresponding to training text _D’-q’ ＝features[2]Feature vector H of field to be masked corresponding to training text _r’ ＝features[3]Feature vector H of training text after masking field to be masked _D’-r’ ＝features[4]. The feature vectors are fixed-length vectors, such as the following:

H _D’ ＝[-0.0323,0.1266,0.0472,0.1089,0.0553,0.0545,0.0998,-

0.0257,0.1071,0.1056]；

H _q’ ＝[-0.0829,-0.0046,0.0912,0.1340,0.1663,0.0394,0.0550,

0.0440,0.1163,0.0548]；

H _D’-q’ ＝[-0.0790,-0.0149,0.0931,0.1209,0.1780,0.0433,0.0376,

0.0570,0.1006,0.0417]；

H _r’ ＝[-0.0674,0.0036,0.0906,0.1176,0.1925,0.0371,0.0524,

0.0338,0.0999,0.0352]；

H _D’-r’ ＝[-0.0851,-0.0110,0.1151,0.1261,0.1964,0.0125,0.0299,

0.0427,0.0864,0.0283]；

in the training process, each of the feature vectors obtained above is substituted into formula (3) to calculate model Loss, and let β=0.5, formula (3) becomes formula (4) as shown below:

Loss＝0.5*((H _D’ ·H _q’ /(|H _D’ |*|H _q’ |))-(H _D’ ·H _r’ /(|H _D’ |*|H _r’ |)))+0.5*((H _D’ ·H _D’-

_r’ /(|H _D’ |*|H _D’-r’ |))-(H _D’ ·H _D’-q’ /(|H _D’ |*|H _D’-q’ |)))(4)

wherein the sim function calculation similarity formula is sim (a, b) =a·b/(|a|b|).

And (3) iteratively updating model parameters of the pre-training model by using a parameter optimization algorithm based on the model Loss obtained in the formula (4) to obtain optimal model parameters, and completing training of the pre-training model to obtain a feature model.

The data processing method provided by the embodiment of the application obtains the feature model by training the pre-training model. On one hand, the semantic similarity degree between the training text Doc and the keyword query corresponding to the training text and the semantic dissimilarity degree between the training text Doc and the training text Doc-query after masking the corresponding keywords are comprehensively calculated, so that the semantic importance of the keyword query in the training text Doc is emphasized, the training text Doc can be replaced to carry out subsequent information retrieval operation, and excessive semantic information is not lost; on the other hand, the semantic similarity degree between the training text Doc and the positive sample query and the semantic dissimilarity degree between the training text Doc and the negative sample, namely the field rand to be masked, are comprehensively calculated, the semantic similarity distance between the keyword query and the field rand to be masked is far enough based on the optimization thought of Max Margin Loss, and the semantic importance of the keyword query in the training text Doc is further emphasized. The semantic similarity is a floating point number in the [0,1] interval, so that the semantic similarity difference between the keyword query and the field rand to be masked is maximized, and the effects of maximizing the semantic similarity degree between the training text Doc and the keyword query and maximizing the semantic dissimilarity degree between the training text Doc and the training text Doc-query after masking the keywords can be approximately achieved. The method and the device supplement each other, and comprehensively consider the length of the keywords and the context information, so that the problems possibly generated by the existing keyword extraction method can be relieved.

In some embodiments, the technical means for extracting the target keyword of the text to be processed by training the obtained feature model and the semantic similarity algorithm may be integrated into information retrieval of a service line as a middleware as described in the above embodiments, so that the retrieval effect and the use experience are improved, and the method and the device can be independently applied to a service scene of keyword extraction, and the embodiment of the application is not limited.

Fig. 4 is a schematic structural diagram of a data processing device according to an embodiment of the present application, where the data processing device may be applied to an information retrieval platform. As shown in fig. 4, a data processing apparatus 400 provided in an embodiment of the present application includes:

the acquiring and generating module 401 is configured to acquire a text to be processed in response to a search instruction, acquire each candidate keyword of the text to be processed, and generate a candidate keyword set of the text to be processed according to each candidate keyword;

the first processing module 402 is configured to determine, for each candidate keyword, a semantic similarity score corresponding to each candidate keyword through a feature model and a semantic similarity algorithm, where the feature model is obtained by performing semi-supervised training on a pre-training model;

the second processing module 403 is configured to obtain a maximum value from the semantic similarity scores corresponding to each candidate keyword, and determine the candidate keyword corresponding to the maximum value as a target keyword, so as to obtain a search result by performing information search according to the target keyword.

In one possible design, the acquiring and generating module 401 is specifically configured to:

and acquiring a first candidate keyword of the text to be processed by using an unsupervised keyword extraction tool, and acquiring a second candidate keyword of the text to be processed by using a random masking strategy, wherein each candidate keyword comprises the first candidate keyword and the second candidate keyword.

In one possible design, the first processing module 402 is specifically configured to:

obtaining a plurality of feature vectors corresponding to the current candidate keywords through a feature model;

acquiring semantic similarity scores corresponding to the current candidate keywords through a semantic similarity algorithm according to the plurality of feature vectors;

the plurality of feature vectors comprise feature vectors of the text to be processed, feature vectors of the current candidate keywords and feature vectors of the mask text corresponding to the current candidate keywords.

Fig. 5 is a schematic structural diagram of another data processing apparatus according to an embodiment of the present application, and as shown in fig. 5, a data processing apparatus 400 according to an embodiment of the present application further includes:

the sample acquiring module 404 is configured to acquire a data set to be trained, where the data set to be trained includes a plurality of training texts, keywords corresponding to each training text, and fields to be masked corresponding to each training text;

The model training module 405 is configured to train the pre-training model according to a preset loss function and a parameter optimization algorithm by using a to-be-trained data set, and determine the pre-training model after training as a feature model;

the method comprises the steps that a preset loss function is obtained according to a plurality of semantic similarity scores of each training text, and the plurality of semantic similarity scores of each training text are used for representing text information carried by keywords corresponding to each training text and context information of each training text after masking the corresponding keywords.

In one possible design, the sample acquisition module 404 is specifically configured to:

acquiring an unlabeled data set, wherein the unlabeled data set comprises a plurality of training texts;

manually marking the keywords of each training text to obtain manually marked keywords, and automatically marking each training text by using an unsupervised keyword extraction tool to obtain automatically marked keywords;

aiming at each training text, acquiring a field to be masked corresponding to each training text through a random selection tool;

And generating a data set to be trained according to the marking data set and the field to be masked corresponding to each training text.

In one possible design, the pre-training model includes any of a BERT model, a GPT, a RoBERTa model, an XLNET model.

In one possible design, the acquisition and generation module 401 is further configured to:

receiving service information sent by a service end, and generating a search instruction according to the service information;

and acquiring a characteristic field of the service information according to the retrieval instruction to obtain a text to be processed according to the characteristic field, wherein the characteristic field is used for representing the service to be handled corresponding to the service information.

The data processing device provided in the embodiment of the present application may execute corresponding steps of the data processing method in the embodiment of the method, and its implementation principle and technical effects are similar, and are not described herein again.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 6, the electronic device 500 may include: a processor 501, and a memory 502 communicatively coupled to the processor 501.

A memory 502 for storing a program. In particular, the program may include program code including computer-executable instructions.

The memory 502 may comprise high-speed RAM memory or may further comprise non-volatile memory (NoN-volatile memory), such as at least one disk memory.

The processor 501 is configured to execute computer-executable instructions stored in the memory 502 to implement the data processing method described above.

The processor 501 may be a central processing unit (Central Processing Unit, abbreviated as CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more integrated circuits configured to implement embodiments of the present application.

Alternatively, the memory 502 may be separate or integrated with the processor 501. When the memory 502 is a device separate from the processor 501, the electronic device 500 may further include:

a bus 503 for connecting the processor 501 and the memory 502. The bus may be an industry standard architecture (industry standard architecture, abbreviated ISA) bus, an external device interconnect (peripheral component, PCI) bus, or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. Buses may be divided into address buses, data buses, control buses, etc., but do not represent only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 502 and the processor 501 are integrated on a chip, the memory 502 and the processor 501 may complete communication through an internal interface.

The present application also provides a computer-readable storage medium, which may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random AccessMemory), a magnetic disk, or an optical disk, etc., in which program codes may be stored, and specifically, a computer-readable storage medium having stored therein computer-executable instructions for use in the methods in the above-described embodiments.

The present application also provides a computer program product comprising computer-executable instructions which, when executed by a processor, implement the method of the above embodiments.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the present application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A data processing method, applied to an information retrieval platform, comprising:

responding to a search instruction to obtain a text to be processed, obtaining candidate keywords of the text to be processed, and generating a candidate keyword set of the text to be processed according to the candidate keywords;

2. The method for processing data according to claim 1, wherein the obtaining each candidate keyword of the text to be processed includes:

3. The method according to claim 2, wherein determining the semantic similarity score corresponding to each candidate keyword through the feature model and the semantic similarity algorithm comprises:

4. A data processing method according to claim 3, wherein semi-supervised training of the pre-trained model to obtain the feature model comprises:

5. The method for processing data according to claim 4, wherein the acquiring the data set to be trained includes:

6. The data processing method of claim 5, wherein the unsupervised keyword extraction tool comprises any one of a phrase generation model, a keyword screening model, and a phrase extraction model, and the random selection tool comprises a word segmentation tool.

7. The data processing method according to any one of claims 3 to 6, wherein the pre-training model comprises any one of a BERT model, a GPT, a RoBERTa model, and an XLNET model.

8. The data processing method according to claim 7, wherein the obtaining the text to be processed in response to the retrieval instruction includes:

9. The data processing method according to claim 8, wherein the business to be handled includes any one of merchant information verification, office information retrieval, and database retrieval.

10. A data processing apparatus for use with an information retrieval platform, comprising:

the acquisition and generation module is used for responding to the search instruction to acquire a text to be processed, acquiring each candidate keyword of the text to be processed, and generating a candidate keyword set of the text to be processed according to each candidate keyword;

11. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

The memory stores computer-executable instructions;

the processor executes computer-executable instructions stored in the memory to implement the data processing method of any one of claims 1 to 9.

12. A computer-readable storage medium, in which computer-executable instructions are stored, which computer-executable instructions, when executed by a processor, are for implementing a data processing method according to any one of claims 1 to 9.

13. A computer program product comprising computer-executable instructions for implementing the data processing method of any one of claims 1 to 9 when executed by a processor.