CN115858886B - Data processing method, device, equipment and readable storage medium - Google Patents

Data processing method, device, equipment and readable storage medium Download PDF

Info

Publication number
CN115858886B
CN115858886B CN202211606585.2A CN202211606585A CN115858886B CN 115858886 B CN115858886 B CN 115858886B CN 202211606585 A CN202211606585 A CN 202211606585A CN 115858886 B CN115858886 B CN 115858886B
Authority
CN
China
Prior art keywords
sample
service
query
business
class
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211606585.2A
Other languages
Chinese (zh)
Other versions
CN115858886A (en
Inventor
张云燕
吴贤
赖炜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202211606585.2A priority Critical patent/CN115858886B/en
Publication of CN115858886A publication Critical patent/CN115858886A/en
Application granted granted Critical
Publication of CN115858886B publication Critical patent/CN115858886B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method, a device, equipment and a readable storage medium, wherein the method can be applied to various scenes such as artificial intelligence, medical treatment and the like, and comprises the following steps: respectively configuring a labeling service positive sample set and a labeling service negative sample set for M to-be-processed service query samples in an initial service query sample set according to the service category rough prediction information; according to M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets, N query sample triples and positive weight parameter sets and negative weight parameter sets respectively corresponding to the N query sample triples are obtained; training the initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model. By adopting the method and the device, the accuracy of service class prediction can be improved.

Description

Data processing method, device, equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a data processing method, apparatus, device, and readable storage medium.
Background
Along with the progress of machine learning technology and the increasing of public medical demands, an intelligent business guiding system gradually becomes the attention of each large enterprise, and has the function of recommending corresponding business categories for users according to business demands described by the users, so that more timely, efficient and accurate services are provided for the users, the workload of enterprise personnel is reduced, and the overall operation efficiency of the enterprise is improved.
The implementation of the intelligent service guidance system generally depends on the classification model, but the data amount required for training the classification model is large, so that the service categories that the intelligent service guidance system can guide are often limited to a large number of common standard service categories existing in the service categories of most enterprises, for example, in a medical scenario, the medical departments recommended by the intelligent service guidance system (for example, the intelligent triage system) used by a hospital are often limited to standard departments, such as "internal medicine," "surgery," "orthopedics," "pediatric," "respiratory medicine," "digestive medicine," and the like, and most hospitals are provided with conventional departments. However, besides the common standard business category, a large number of special business categories exist for different enterprises in different business scenes, the business category is usually focused on a specific type of business, the difference degree between the special business categories among different enterprises is large, the variety of the special business categories is large, and in the acquired query data, the proportion of the query data related to the special business categories is low, so that massive query data are required to be marked manually to acquire training data, the efficiency of manually acquiring the training data is low, a large amount of time and cost are consumed, the required training data amount is difficult to acquire, and the finally-trained model is difficult to meet the accuracy required by a practical line.
Disclosure of Invention
The embodiment of the application provides a data processing method, a device, equipment and a readable storage medium, which can improve the accuracy of service class prediction.
In one aspect, an embodiment of the present application provides a data processing method, including:
acquiring business category rough prediction information corresponding to an initial business query sample set, and respectively configuring a labeling business positive sample set and a labeling business negative sample set for M business query samples to be processed in the initial business query sample set according to the business category rough prediction information; m is a positive integer;
combining and pairing M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets to obtain N query sample triples, and determining positive weight parameter sets and negative weight parameter sets respectively corresponding to the N query sample triples according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets; each query sample triplet comprises service query samples belonging to M service query samples to be processed, marking service positive samples belonging to M marking service positive sample sets and marking service negative samples belonging to M marking service negative sample sets; n is a positive integer greater than or equal to M;
Training an initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model; the target service guide model is used for predicting service class labels corresponding to the service inquiry information; each positive weight parameter is used for controlling the training influence of the similarity between a service inquiry sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
An aspect of an embodiment of the present application provides a data processing apparatus, including:
the first acquisition module is used for acquiring the business category rough prediction information corresponding to the initial business inquiry sample set;
the marking screening module is used for respectively configuring marking service positive sample sets and marking service negative sample sets for M to-be-processed service query samples in the initial service query sample sets according to the service category rough prediction information; m is a positive integer;
the sample processing module is used for carrying out combined pairing processing on M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets to obtain N query sample triples, and determining positive weight parameter sets and negative weight parameter sets respectively corresponding to the N query sample triples according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets; each query sample triplet comprises service query samples belonging to M service query samples to be processed, marking service positive samples belonging to M marking service positive sample sets and marking service negative samples belonging to M marking service negative sample sets; n is a positive integer greater than or equal to M;
The first training module is used for training the initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model; the target service guide model is used for predicting service class labels corresponding to the service inquiry information; each positive weight parameter is used for controlling the training influence of the similarity between a service inquiry sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
In one aspect, a computer device is provided, including: a processor, a memory, a network interface;
the processor is connected to the memory and the network interface, where the network interface is used to provide a data communication network element, the memory is used to store a computer program, and the processor is used to call the computer program to execute the method in the embodiment of the present application.
In one aspect, embodiments of the present application provide a computer readable storage medium having a computer program stored therein, the computer program being adapted to be loaded by a processor and to perform a method according to embodiments of the present application.
In one aspect, the embodiments of the present application provide a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, where the computer instructions are stored in a computer readable storage medium, and where a processor of a computer device reads the computer instructions from the computer readable storage medium, and where the processor executes the computer instructions, so that the computer device performs a method in an embodiment of the present application.
In the embodiment of the application, the service category rough prediction information corresponding to the initial service query sample can be acquired first, label screening processing is carried out on the initial service query sample set according to the service category rough prediction information to obtain M to-be-processed service query samples belonging to the initial service query sample set, each to-be-processed query sample corresponds to a label service positive sample set and a label service negative sample set respectively, then the M to-be-processed service query samples, the M label service positive sample sets and the M label service negative sample sets are combined to obtain N query sample triples, and then the positive weight parameter set and the negative weight parameter set respectively corresponding to the N query sample triples are determined according to the N query sample triples, the M label service positive sample sets and the M label service negative sample sets; wherein M is a positive integer, N is a positive integer greater than M, and each query sample triplet comprises service query samples belonging to M service query samples to be processed, labeling service positive samples belonging to M labeling service positive sample sets and labeling service negative samples belonging to M labeling service negative sample sets; and finally, training the initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model, wherein the target service guide model can be used for predicting service class labels corresponding to service query information. By adopting the method provided by the embodiment of the application, the marking service positive sample set and the marking service negative sample set can be respectively configured for M to-be-processed service query samples in the initial service query sample set through the service category coarse prediction information, the training data acquisition efficiency is improved, the acquisition cost and time of training data are greatly reduced, in addition, N query sample triples for training an initial service guide model can be constructed based on the marking service positive sample set and the marking service negative sample set respectively configured for the M to-be-processed service query samples, and positive weight parameters and negative weight parameters are introduced when the initial service guide model is trained based on the N query sample triples, so that training data are expanded, the training influence of similar samples in different query sample triples on the initial service guide model is reduced, and the accuracy of service category prediction can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a network architecture according to an embodiment of the present application;
FIG. 2 is a schematic view of a model training scenario provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of a data processing method according to an embodiment of the present application;
FIG. 5 is a training schematic diagram of a semi-supervised contrast learning sickness-specific matching model provided in an embodiment of the present application;
FIG. 6 is a schematic flow chart of a data processing method according to an embodiment of the present application;
fig. 7a is a schematic diagram of a process flow of a standardized module for a department of disease provided in an embodiment of the present application;
FIG. 7b is a schematic diagram of a treatment process of a diagnosis-specific module according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Deep Learning (DL) is an inherent rule and presentation hierarchy of Learning sample data, and information obtained in these Learning processes greatly helps interpretation of data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art.
The scheme provided by the embodiment of the application relates to artificial intelligence natural language processing technology, machine learning, deep learning and other technologies, and is specifically described by the following embodiment.
Specifically, referring to fig. 1, fig. 1 is a schematic diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 100 and a cluster of terminal devices. Wherein the cluster of terminal devices may in particular comprise one or more terminal devices, the number of terminal devices in the cluster of terminal devices will not be limited here. As shown in fig. 1, the plurality of terminal devices may specifically include a terminal device 10a, a terminal device 10b, terminal devices 10c, …, a terminal device 10n; the terminal devices 10a, 10b, 10c, …, 10n may be directly or indirectly connected to the server 100 through a wired or wireless communication manner, respectively, so that each terminal device may interact with the server 100 through the network connection.
Wherein each terminal device in the terminal device cluster may include: smart phones, tablet computers, notebook computers, desktop computers, intelligent voice interaction devices, intelligent home appliances (e.g., smart televisions), wearable devices, vehicle terminals, aircraft and other intelligent terminals with data processing functions. It should be understood that each terminal device in the terminal device cluster shown in fig. 1 may be installed with an application client having a multimedia data processing function, and when the application client runs in each terminal device, data interaction may be performed between each terminal device and the server 100 shown in fig. 1. The application client may specifically include: vehicle clients, smart home clients, entertainment clients (e.g., game clients), multimedia clients (e.g., video clients), social clients, and information-based clients (e.g., news clients), etc. The application client in the embodiment of the present application may be integrated in a certain client (for example, a social client), and the application client may also be an independent client (for example, a news client), which is not limited by the type of the application client in the embodiment of the present application.
For ease of understanding, the embodiment of the present application may select one terminal device from the plurality of terminal devices shown in fig. 1 as the target terminal device. For example, the embodiment of the present application may use the terminal device 10a shown in fig. 1 as a target terminal device, and an application client having a multimedia data processing function may be installed in the target terminal device. At this time, the target terminal device may implement data interaction between the application client and the server 100.
The server 100 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
It is to be appreciated that embodiments of the present application may be applied to a variety of scenarios including, but not limited to, cloud technology, artificial intelligence, intelligent transportation, assisted driving, and the like. For example, the block link point system may perform consensus on some driving behavior data, road track data, and the like sent by the vehicle-mounted terminal, and store the consensus on the vehicle-mounted terminal after the consensus passes.
It will be appreciated that in the specific embodiments of the present application, related data such as business query samples are referred to, and when the following embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with related laws and regulations and standards of related countries and regions.
It is to be understood that the network framework described above may be applicable to the service class guidance field, where applicable service scenarios include, but are not limited to, medical scenarios, government scenarios, educational scenarios, etc., and those applicable service scenarios in the embodiments of the present application will not be listed here. The service category generally refers to classifying the service according to the difference of different services under different service scenes, and the service of the same service category is generally delivered to special service personnel for processing, so that the efficiency of completing the service can be improved. Furthermore, it may be understood that, for different enterprises or organizations in the same service scenario, the service categories provided by the enterprises or organizations may be different, and the parlance or naming of the different enterprises or organizations for the same service category may also be different, for example, the enterprise B and the enterprise C in the service scenario a may process the service of the service category D, and the enterprise C may process the service of the service category E; in addition, both enterprise B and enterprise C may handle the business of business class F, but enterprise B may refer to this business class F as "business class name 1", and enterprise C may refer to this business class F as "business class name 2". Therefore, when a non-business person wants to process a certain business, the non-business person often needs to spend a lot of time and effort to determine the business category corresponding to the business that needs to be processed by the non-business person due to lack of knowledge or experience related to the business and possibly inconsistent naming of the same business category by different enterprises. Therefore, when a non-business person needs to transact a certain business, the real business category information corresponding to the target business query information can be determined according to the target business query information of the non-business person aiming at the business, and corresponding business category guiding information can be generated.
For example, in a medical scenario: hospitals are often divided into a plurality of departments, different departments are responsible for treating different types of diseases, and departments included in different hospitals often have large differences. When the patient is physically untimely, due to lack of experience and knowledge related to the disease, the patient is generally unaware of the number of which department to visit, and at this time, the patient's own description of the symptoms (such as "bad sleep, what department to do with sleep examination.
For another example, in an educational scenario: the training institutions often set a plurality of training projects, different training projects are responsible for training different skills, and the training projects provided by different training institutions often have great differences. When a learner wants to improve a certain skill or widen a certain type of knowledge, the learner may report the training program by unfamiliar with the training program, at this time, the learner may upload his own learning requirement (such as "improving his own spoken english language ability") to an on-line platform of a mechanism corresponding to the training mechanism, the on-line platform of the mechanism may recognize the learning requirement, determine the training program corresponding to the learning requirement, assume that the training program is in a class in 30 days spoken language, and then the on-line platform of the mechanism may generate training program recommendation information to recommend the training program for the learner.
It can be understood that the embodiment of the application can provide a data processing method for training a target service guiding model, and the target service class label corresponding to the target service inquiry information in a certain service scene can be automatically and rapidly predicted through the target service guiding model obtained by training, and the real service class information corresponding to the target service inquiry information can be determined according to the target service class label, so that the corresponding service class guiding information is generated. The target service query information refers to query text input by the object, such as the above condition description or learning requirement. The target business category label can be a number or other character information with identification distinguishing meaning, and is used for marking real business category information, for example, if the target business category label is a value 1, the corresponding real business category information is "sleep clinic". The service class guiding information is used for guiding the object to select corresponding real service class information, such as: "your department that can select registration is: sleep outpatient service. It will be appreciated that the target traffic guidance model can only predict different traffic classes in the same traffic scenario, and thus when the target traffic guidance model is obtained through training in the following description, the query samples used in training should be the query samples in the same traffic scenario.
Specifically, the embodiment of the application may acquire an initial service query sample set for model training, where the initial service query sample set may include L unlabeled service query samples, where L is a positive integer, and the unlabeled service query samples refer to service query samples that are not labeled with their corresponding service classes. And then, obtaining business category rough prediction information corresponding to the initial business query sample set, wherein the business category rough prediction information can comprise sample business category rough prediction information corresponding to L non-marked business query samples respectively, and each sample business category rough prediction information is obtained by respectively carrying out business category rough prediction processing on the L non-marked business query samples and is used for representing sample business categories and sample business category probabilities possibly corresponding to each non-marked business query sample. Then, a labeling service positive sample set and a labeling service negative sample set can be respectively configured for M to-be-processed service query samples in the initial service query sample set according to the service category rough prediction information; wherein M is a positive integer, and M service query samples to be processed belong to L non-labeling service query samples; the marking service positive sample set comprises marking service positive samples, and one marking service positive sample is used for representing service categories possibly matched with the service query sample to be processed; the labeling business negative sample set comprises labeling business negative samples, and one labeling business negative sample is used for representing business categories which are not matched with the business query sample to be processed. Then, carrying out data preprocessing on M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets to obtain N query sample triples, and respectively corresponding positive weight parameter sets and negative weight parameter sets of the N query sample triples; each query sample triplet comprises a service query sample belonging to M service query samples to be processed, a labeling service positive sample belonging to M labeling service positive sample sets and a labeling service negative sample belonging to M labeling service negative sample sets, wherein N is a positive integer greater than or equal to M, for convenience in understanding, the assumption is made that the M service query samples to be processed comprise sample 1, the labeling service positive sample set corresponding to sample 1 comprises positive sample 1 and positive sample 2, the labeling service negative sample set corresponding to sample 1 comprises negative sample 1, and the query sample triples which can be obtained are [ sample 1, positive sample 1, negative sample 1] and [ sample 1, positive sample 2, negative sample 1]. And finally training the initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain the target service guide model. The target service guide model is used for predicting service class labels corresponding to the service inquiry information; each positive weight parameter is used for controlling the training influence of the similarity between a service inquiry sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic view of a model training scenario provided in an embodiment of the present application. The server 20a shown in fig. 2 may be the server 200 in the embodiment corresponding to fig. 1, and the terminal device 20b shown in fig. 2 may be the target terminal device in the embodiment corresponding to fig. 1. For ease of understanding, the user corresponding to the target terminal device may be referred to as an object, where the terminal device 20b may be used to initiate the model training request, and the user corresponding to the terminal device 20b may be the object 20c.
As shown in fig. 2, the object 20c may initiate a model training request for the target service scenario to the server 20a through the terminal device 20b, and for convenience of understanding, the target service scenario is herein described as an example of a medical scenario. After receiving the model training request for the medical scene, the server 20a may acquire an initial service query sample set 21 related to the medical scene, where the initial service query sample set 21 may include a label-free service query sample 21a, label-free service query samples 21b and …, and label-free service query sample 21L, for example, the label-free service query sample 21a may be "early child development, obesity"; the unlabeled business query sample 21b may be "how the lumbar muscle strain is done". Then, the server 20a may perform a service class rough prediction on each unlabeled service query sample, and determine sample service class rough prediction information corresponding to each unlabeled service query sample, where each sample service class rough prediction information may include a sample service class and a sample service class probability corresponding to the unlabeled service query sample. As shown in fig. 2, the unlabeled service query sample 21a corresponds to sample service class rough prediction information 22a, and the sample service class rough prediction information 22a may be "weight management clinic, 0.56; height management promotes outpatient service, 0.38; the growing clinic, 0.37", and likewise, the unlabeled business query sample 21b corresponds to the sample business class rough prediction information 22b, …, and the unlabeled business query sample 21L corresponds to the sample business class rough prediction information 22L.
Then, the server 20a may screen the L unlabeled service query samples according to the service class rough prediction information (i.e., all sample service class rough prediction information), that is, obtain M unlabeled service query samples from the L unlabeled service query samples, as service query samples to be processed, where the server 20a may obtain the service query sample 23a to be processed, the service query samples 23b to be processed, …, and the service query sample 23M to be processed. Screening refers to eliminating or filtering out unlabeled service query samples with inaccurate rough prediction information of corresponding sample service categories or text contents less relevant to the service categories which are expected to be identified. After obtaining M to-be-processed service query samples, the server 20a may configure a labeling service positive sample set and a labeling service negative sample set for each to-be-processed service query sample according to the sample service category rough prediction information corresponding to each to-be-processed service query sample, as shown in fig. 2, through positive and negative sample configuration of the server 20a, the to-be-processed service query sample 23a corresponds to the labeling service positive sample set 241a and the labeling service negative sample set 242a, the to-be-processed service query sample 23b corresponds to the labeling service positive sample set 241b and the labeling service negative sample sets 242b and …, and the to-be-processed service query sample 23M corresponds to the labeling service positive sample set 241M and the labeling service negative sample set 242M. The configuration refers to adjusting and classifying the sample business category rough prediction information corresponding to the business query sample to be processed according to a preset business category adjustment rule or by manual labeling, for example, the business query sample to be processed 23a may be the above-mentioned example non-labeling business query sample 21a, and then the sample business category rough prediction information 22a corresponds to the sample business query sample to be processed, so that it may be determined that the labeling business positive sample set 241a corresponding to the business query sample to be processed 23a is { weight management clinic, growth development clinic }, and the labeling business negative sample set 242a corresponding to the labeling business negative sample set 242a is { height management promotion clinic }, that is, the business category corresponding to the business query sample to be processed 23a may be "weight management clinic" or "growth development clinic", so that the server 20a may use "weight management clinic" or "growth development clinic" as the labeling business positive sample corresponding to the business query sample to be processed, but the business category corresponding to the business query sample to be processed 23a cannot be "height management promotion clinic", and thus the server 20a may use "height management promotion clinic" as the labeling business sample corresponding to the business query sample to be processed.
Then, the server 20a may perform data preprocessing on the M to-be-processed service query samples, the M labeling service positive sample sets, and the M labeling service negative sample sets to obtain N query sample triples, where, as shown in fig. 2, the N query sample triples may include a query sample triplet 25a, query sample triples 25b, …, and a query sample triplet 25N. Each query sample triplet comprises a service query sample belonging to M service query samples to be processed, a labeling service positive sample belonging to M labeling service positive sample sets and a labeling service negative sample belonging to M labeling service negative sample sets, so that N is a positive integer greater than or equal to M. For example, according to the above-mentioned labeling service positive sample set 241a corresponding to the service query sample 23a to be processed being { weight management clinic, growth development clinic } and labeling service negative sample set 242a being { height management promotion clinic }, a query sample triplet 25a and a query sample triplet 25b may be constructed, where the query sample triplet 25a may be { service query sample 23a to be processed, weight management clinic, height management promotion clinic }, and the query sample triplet 25b may be { service query sample 23a to be processed, growth development clinic, height management promotion clinic }. Meanwhile, in the process of data preprocessing, the server 20a may determine positive weight parameter sets and negative weight parameter sets corresponding to the N query sample triplets respectively, and then train the initial service guide model according to the N query sample triplets, the N positive weight parameters in each positive weight parameter set, and the N negative weight parameters in each negative weight parameter set, so as to obtain the target service guide model. Each positive weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model. As shown in fig. 2, the server 20a may ultimately obtain the target traffic guidance model 26.
It can be understood that, after the server 20a obtains the target service guiding model 26 for the target service scenario, the terminal device 20b or other terminal devices having network connection with the server 20 a) may send the service query information to the server 20a after obtaining the service query information under the target service scenario, the server 20a may predict the service class label corresponding to the service query information through the target service guiding model 26, so as to determine the service class information corresponding to the service query information according to the service class label, and then return to the terminal device sending the service query information, where the terminal device 20b may generate corresponding service guiding information after receiving the service class information, so as to inform the object 20c of the service class corresponding to the service query information input by the object.
Alternatively, it will be appreciated that the server 20a may also send the target traffic guidance model 26 directly to the corresponding terminal device, e.g., terminal device 20b. The terminal device 20b may store and run the target traffic guidance model 26 to predict traffic categories for traffic query information, avoiding the need to send a request to the server 20a for each prediction.
Therefore, the marking service positive sample set and the marking service negative sample set can be rapidly configured for M to-be-processed service query samples in the initial service query sample set through the service category rough prediction information, training data acquisition efficiency is improved, training data acquisition cost and time are greatly reduced, in addition, N query sample triples for training an initial service guide model can be constructed based on the marking service positive sample set and the marking service negative sample set respectively configured for the M to-be-processed service query samples, and positive weight parameters and negative weight parameters are introduced when the initial service guide model is trained based on the N query sample triples, so that training data is expanded, training influence of similar samples in different query sample triples on the initial service guide model is reduced, and accuracy of service category prediction can be further improved.
Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the present application. The method may be performed by a server, or may be performed by a terminal device, or may be performed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S101 to S103:
Step S101, obtaining business category rough prediction information corresponding to an initial business query sample set, and respectively configuring a labeling business positive sample set and a labeling business negative sample set for M business query samples to be processed in the initial business query sample set according to the business category rough prediction information; m is a positive integer.
Specifically, the initial service query sample set may include L unlabeled service query samples, such as the unlabeled service query sample 21a shown in fig. 2 above; l is a positive integer greater than or equal to M. The unlabeled service query samples are not labeled service query samples and their corresponding service categories are unknown. The business categories may be classified into standard business categories and characteristic business categories, where the standard business categories refer to recognized or conventional business categories, that is, business categories that most enterprises or organizations default to use, for example, in a medical scenario, "medical department," "surgical department," "orthopedics," "pediatric department," "respiratory department," "gastroenterology," etc. medical departments belong to standardized departments, and most hospitals are provided with the standardized departments and remain consistent in terms of naming, and the types and ranges of diseases that are responsible for treatment are approximately the same; the special business category generally refers to an unusual or very special business category, for example, in a medical scene, a consultation department not in a standardized department may be called a special department, and is usually a special department of a hospital for treating a specific disease, such as "obesity and metabolism outpatient service", "diabetes special department outpatient service", "diabetes retinopathy outpatient service", etc., and different hospitals have large variability and various special departments. It should be noted that, in the following description, if there is no specific description, the traffic class to be predicted generally refers to a traffic class that does not include a standard traffic class, i.e., a feature traffic class. In addition, the division between the standard service class and the feature service class may be determined according to practical situations, which is not limited herein.
Specifically, a feasible implementation process for obtaining the business category rough prediction information corresponding to the initial business query sample set may be: obtaining a target service standardization model; and respectively carrying out service category rough prediction processing on the L unlabeled service query samples through a target service standardization model to obtain service category rough prediction information respectively corresponding to the L unlabeled service query samples, and taking the L sample service category rough prediction information as service category rough prediction information corresponding to the initial service query sample set.
The target service standardization model is obtained by training an initial service standardization model based on H service category coarse clustering sets, H is a positive integer, each service category coarse clustering set comprises one or more service category samples, and it can be understood that service categories corresponding to the service category samples in the same service category coarse clustering set have higher similarity, in other words, the overlapping of corresponding services is higher. For example, in a medical scenario, the coarse traffic class cluster set 1 may be { wound-making clinic, nursing clinic, ostomy clinic, wound-making nursing clinic, ostomy nursing clinic }, it is understood that the disease scope of the department responsible for treatment in the coarse traffic class cluster set 1 is about the same, and because different hospitals name the departments treating a certain type of disease, one coarse traffic class cluster set often contains a plurality of traffic class samples, and the traffic classes pointed by the traffic class samples are about the same. Further, it may be appreciated that the traffic class to which each traffic class coarse cluster set corresponds is typically a feature traffic class.
Specifically, the traffic class rough prediction information may include sample traffic class rough prediction information corresponding to the L unlabeled traffic query samples, for example, the sample traffic class rough prediction information 22a shown in fig. 2. At this time, one feasible implementation process of respectively configuring the labeling service positive sample set and the labeling service negative sample set for the M to-be-processed service query samples in the initial service query sample set according to the service category rough prediction information may be: traversing sample business category rough prediction information corresponding to the L unlabeled business query samples respectively, and sequentially acquiring sample business category rough prediction information corresponding to the ith unlabeled business query sample as the ith sample business category rough prediction information; i is a positive integer less than or equal to L; if the i sample business category rough prediction information comprises sample business category probability which is greater than or equal to a business category probability threshold value, adding the i unlabeled business query sample to a first prediction result sample set; if the i sample business category rough prediction information does not contain the sample business category probability which is greater than or equal to the business category probability threshold value, adding the i unlabeled business query sample to a second prediction result sample set; when L sample business category rough prediction information is traversed, M non-marked business query samples are obtained from a first prediction result sample set to serve as business query samples to be processed, A non-marked business query samples are obtained from a second prediction result sample set to serve as A difficult negative samples, and the A difficult negative samples are added to a difficult negative sample set; a is a positive integer, and the proportional relation between A and M meets the preset proportional condition; and respectively configuring a labeling service positive sample set and a labeling service negative sample set for the M to-be-processed service query samples according to the difficult negative sample set and the sample service category rough prediction information respectively corresponding to the M to-be-processed service query samples.
To facilitate understanding of the above procedure, taking the ith unlabeled service query sample as the unlabeled service query sample 21a shown in fig. 2, the ith sample service class rough prediction information is the sample service class rough prediction information 22a shown in fig. 2, and the sample service class rough prediction information 22a is "weight management clinic, 0.56; height management promotes outpatient service, 0.38; the growing development outpatient service, 0.37", indicates that the sample business category rough prediction information 22a includes three sample business categories and the corresponding sample business category probabilities thereof, and it can be understood that the sample business category probability that the i sample business category rough prediction information includes the business category probability greater than or equal to the business category probability threshold value refers to that in the i sample business category rough prediction information, one sample business category probability is greater than or equal to the business category probability threshold value, and the business category probability threshold value is assumed to be 0.5, at this time, only the sample business category probability corresponding to the weight management outpatient service is greater than 0.5 in the sample business category rough prediction information 22a, but the server 20a can still add the sample business category rough prediction information 22a to the first prediction result sample set. Of course, if the traffic class probability threshold is 0.6, the server 20a needs to add the sample traffic class rough prediction information 22a to the second set of prediction result samples.
It can be understood that after the traversal of the coarse prediction information of the L sample service categories, the unlabeled service query samples included in the obtained first prediction result sample set can be regarded as data with a prediction result, and the subsequent model training is performed by adopting the unlabeled service query samples included in the first prediction result sample set, so that the training effect can be greatly improved. In addition, the unlabeled business query sample included in the second prediction result sample set can be regarded as data without a prediction result, and can be used as a negative sample corresponding to training data although the unlabeled business query sample can not be used as training data, so that the model training effect is improved. Therefore, after the first prediction result sample set and the second prediction result sample set are obtained, M non-labeling service query samples can be obtained from the first prediction result sample set and used as service query samples to be processed, A non-labeling service query samples are obtained from the second prediction result sample set and used as A difficult negative samples and added to the difficult negative sample set. It will be appreciated that the ratio between M and A satisfies the preset ratio condition, and that M should be equal to 4*A assuming that the preset ratio condition is 8:2.
Specifically, the M service query samples to be processed include service query sample M to be processed j J is a positive integer less than or equal to M; pending service query sample M j The corresponding sample business category rough prediction information comprises B sample business category rough prediction information pairs; the coarse prediction information pair of one sample service class comprises one sample service class and one service class probability; b is a positive integer. According to the above-mentioned rough prediction information of sample service categories corresponding to the difficult negative sample set and the M service query samples to be processed respectively, one possible implementation process of configuring the labeling service positive sample set and the labeling service negative sample set for the M service query samples to be processed respectively may be: creation of a treatment industrySample M of business query j A corresponding initial positive sample set and initial negative sample set; the initial positive sample set and the initial negative sample set are empty sets; traversing a service query sample M to be processed j The corresponding B sample business category rough prediction information pairs sequentially acquire the kth sample business category and the kth sample business category probability; k is a positive integer less than or equal to B; if the probability of the kth sample business category is greater than or equal to the business category probability threshold, sample matching is carried out on the kth sample business category according to positive and negative sample matching rules, and a sample matching result is obtained; if the sample matching result of the kth sample service class is a positive sample result, the kth sample service class is taken as a service query sample M to be processed j Corresponding marked service positive samples, inquiring the service to be processed into a sample M j Adding the corresponding labeling service positive sample to the initial positive sample set; if the sample matching result of the kth sample service class is a negative sample result, the kth sample service class is taken as a service query sample M to be processed j Corresponding negative sample of marking service, inquiring sample M of service to be processed j Adding the corresponding negative sample of the labeling service to the initial negative sample set; if the B sample business category rough prediction information pairs have been traversed and the initial negative sample set is an empty set, obtaining a difficult negative sample from the difficult negative sample set as a business query sample M to be processed j Corresponding negative sample of marking service, inquiring sample M of service to be processed j Adding the corresponding negative sample of the labeling service to the initial negative sample set; sample M of the service query which is added with pending service j The initial positive sample set of the corresponding marked service positive sample is determined to be the service query sample M to be processed j The corresponding marked service positive sample set adds the service query sample M to be processed j The initial negative sample set of the corresponding marked service negative sample is determined as a service query sample M to be processed j And the corresponding negative sample set of the labeling service. Wherein, the positive and negative sample matching rules can be matching rules set based on manual experience or knowledge, which are used for verifying the correctness of the sample service class, if the sample matching result of the kth sample service class is a positive sample result, The predicted result is described as no problem, the service query sample M to be processed j The corresponding service class can be the kth sample service class, and the kth sample service class can be used as a service query sample M to be processed j The corresponding marked business positive sample; if the sample matching result of the kth sample service class is a negative sample result, indicating that the prediction result has deviation, the service query sample M to be processed j The corresponding service class cannot be the kth sample service class, and the kth sample service class can be used as a service query sample M to be processed j And corresponding negative samples of the labeling business. Optionally, for accuracy of positive and negative sample matching, a manual labeling mode may be adopted, i.e. a labeling person manually checks the service query sample M to be processed j Accuracy of the corresponding sample traffic class.
To facilitate understanding of the above process, sample M is queried with pending service j To be the sample 23a for query to be processed shown in fig. 2, sample M for query to be processed j The corresponding sample traffic class rough prediction information is described by taking the sample traffic class rough prediction information 22a shown in fig. 2 as an example. The sample business class rough prediction information 22a is "weight management clinic, 0.56; height management promotes outpatient service, 0.38; the growth and development clinic is 0.37", wherein the weight management clinic is a sample business category rough prediction information pair, and the sample business category rough prediction information 22a comprises three sample business category rough prediction information pairs. The server 20a may first create an initial positive sample set (1) and an initial negative sample set (2) corresponding to the query sample 23a to be processed, and assuming that the traffic class probability threshold is 0.3, the server 20a may sequentially traverse the 3 sample traffic class rough prediction information pairs corresponding to the sample traffic class rough prediction information 22a, and first obtain "weight management clinic, 0.56", because 0.56 is greater than 0.3, so that the server 20a may sample the kth sample traffic class according to the positive and negative sample matching rules to obtain a sample matching result, and obviously, "weight management clinic" is matched with the query sample 23a to be processed, so that the server 20a will take "weight management clinic" as the labeling traffic positive sample and add it into the initial sample In the positive sample set (1). Then, the server 20a will continue to obtain the rough prediction information pair of the sample traffic class to make corresponding judgment and match. After the traversal is finished, the server 20a may obtain the initial positive sample set (1) as { weight management clinic, growth clinic } and the initial negative sample set (2) as { height management promotion clinic }, because both sets are no longer empty sets, the server 20a may use the initial positive sample set (1) as a labeling service positive sample set corresponding to the query sample 23a to be processed, and use the initial negative sample set (2) as a labeling service negative sample set corresponding to the query sample 23a to be processed.
Specifically, in a medical scenario, obtaining service category rough prediction information corresponding to an initial service query sample set, and respectively configuring a labeling service positive sample set and a labeling service negative sample set for M to-be-processed service query samples in the initial service query sample set according to the service category rough prediction information, where one possible embodiment may be: and acquiring the special disease department coarse prediction information (i.e. the business category coarse prediction information) corresponding to the disease query sample set (i.e. the initial business query sample set), and respectively configuring a special disease department positive sample set (i.e. the labeling business positive sample set) and a special disease department negative sample set (i.e. the labeling business negative sample set) for M to-be-processed disease query samples (i.e. to-be-processed business query samples) in the disease query sample set according to the special disease department coarse prediction information.
Step S102, carrying out combined pairing treatment on the M to-be-treated service query samples, the M labeling service positive sample sets and the M labeling service negative sample sets to obtain N query sample triples, and determining positive weight parameter sets and negative weight parameter sets respectively corresponding to the N query sample triples according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets; n is a positive integer greater than or equal to M.
Specifically, each query sample triplet includes a service query sample belonging to M service query samples to be processed, a labeling service positive sample belonging to M labeling service positive sample sets, and a labeling service negative sample belonging to M labeling service negative sample sets.
Specifically, the combination pairing processing is performed on the M to-be-processed service query samples, the M labeling service positive sample sets and the M labeling service negative sample sets, so as to obtain a feasible implementation process of the N query sample triples, which may be: and respectively matching the M to-be-processed service query samples with each labeling service positive sample in the corresponding labeling service positive sample set to obtain N query sample pairs, and distributing one labeling service negative sample for the query sample pairs according to the labeling service negative sample set corresponding to the service query sample contained in each query sample pair to obtain N query sample triples. For the convenience of understanding, assuming that the M to-be-processed service query samples include a to-be-processed service query sample R1, a labeling service positive sample set corresponding to the to-be-processed service query sample R1 is { a labeling service positive sample R2, a labeling service positive sample R3}, a labeling service negative sample set corresponding to the to-be-processed service query sample R1 is { a labeling service negative sample R4, and a labeling service negative sample R5}, matching R1 with the corresponding labeling service positive sample set can obtain 2 query sample pairs, namely [ R1, R2] and [ R1, R3], and then allocating one labeling service negative sample to 2 query sample pairs respectively, so as to obtain 2 query sample triples, namely [ R1, R2, R4] and [ R1, R3, R5].
Specifically, according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets, determining a feasible implementation process of the positive weight parameter set and the negative weight parameter set corresponding to the N query sample triples respectively may be: traversing the N query sample triples, and sequentially acquiring service query samples in the h query sample triples to serve as target service query samples; h is a positive integer less than or equal to N; the method comprises the steps of taking a labeling service positive sample set corresponding to a target service query sample as a target labeling service positive sample set, and taking a labeling service negative sample set corresponding to the target service query sample as a target labeling service negative sample set; generating a positive weight parameter set corresponding to the target service query sample according to the similarity relation between the target labeling service positive sample set and the N query sample ternary sets respectively included by the labeling service positive samples; and generating a negative weight parameter set corresponding to the target service query sample according to the similarity relation between the target labeling service negative sample set and the labeling service negative samples respectively included by the N query sample triples.
Specifically, according to the similarity relationship between the target labeling service positive sample set and the labeling service positive samples respectively included in the N query sample triples, one possible implementation process of generating the positive weight parameter set corresponding to the target service query sample may be: creating an initial positive weight parameter set; the initial positive weight parameter set is an empty set; the number of the labeling service positive samples contained in the target labeling service positive sample set is used as the number of first positive samples; traversing N query sample ternary groups to obtain marking service positive samples respectively included in the query sample ternary groups, and sequentially obtaining g marking service positive samples; g is a positive integer less than or equal to N; the number of marking service positive samples which are different from the g marking service positive samples in the target marking service positive sample set is used as the number of second positive samples; according to the first positive sample number and the second positive sample number, determining a weight parameter used for representing a similarity relation between the target service query sample and the g-th marked service positive sample; adding the weight parameter between the target service query sample and the g labeling service positive sample to an initial positive weight parameter set; when the N labeling service positive samples are traversed, an initial positive weight parameter set containing weight parameters between the target service query sample and the N labeling service positive samples is used as a positive weight parameter set corresponding to the target service query sample.
According to the first positive sample number and the second positive sample number, the similarity between the target service query sample and the g-th marked service positive sample can be determined, and then the weight parameter between the target service query sample and the g-th marked service positive sample can be determined according to the similarity.
Specifically, the similarity between the target service query sample and the g-th labeled service positive sample may be expressed as the following formula (1):
wherein, |label h The i represents the number of class labels contained in the service query sample in the h query sample triplet, namely the number of class labels contained in the target service query sample, wherein the class labels refer to the labeling service positive samples contained in the target labeling service positive sample set corresponding to the target service query sample, and the number of class labels is the first positive sample number. | h -{g + The } | indicates the removal category label g contained in the target service query sample + Wherein category label g + Corresponding to the g label business positive sample h -{g + I.e. the second positive sample number described above. Delta is a constant added to avoid a denominator of 0. That is, the similarity between the target service query sample and the g-th marked service positive sample can be determined according to the direct proportional relation between the second positive sample number and the first positive sample number.
Further, the weight parameter can be expressed as the following formula (2):
wherein μ (h, g + ) The weight parameter between the target service query sample and the g marked service positive sample is obtained. That is, according to the similarity between the target service query sample and the g-th marked service positive sample and the preset exponential inverse proportion relation between the similarity and the weight parameter, the weight parameter used for representing the similarity between the target service query sample and the g-th marked service positive sample can be obtained. The index inverse proportion relation means that the larger the similarity between the target service query sample and the g-th marked service positive sample is, the smaller the weight parameter between the target service query sample and the g-th marked service positive sample is.
Specifically, according to the similarity relationship between the target labeling service negative sample set and the labeling service negative samples respectively included in the N query sample triples, a feasible implementation process of generating the negative weight parameter set corresponding to the target service query sample may refer to the above-mentioned positive weight parameter set generation process, which is not described herein again.
Specifically, in a medical scenario, performing combined pairing processing on M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets to obtain N query sample triples, and determining, according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets, one possible embodiment of the positive weight parameter set and the negative weight parameter set corresponding to the N query sample triples respectively may be: and carrying out combined pairing treatment on M to-be-treated disease inquiry samples, M special disease department positive sample sets and M special disease department negative sample sets to obtain N disease inquiry sample triplets (i.e. inquiry sample triplets), and determining positive weight parameter sets and negative weight parameter sets respectively corresponding to the N disease inquiry sample triplets according to the N disease inquiry sample triplets, the M special disease department positive sample sets and the M special disease department negative sample sets.
The combination pairing treatment is performed on the M disease query samples to be treated, the M disease department positive sample sets and the M disease department negative sample sets to obtain one feasible implementation process of the N disease query sample triplets, which can be as follows: and respectively matching the M disease query samples with each disease-specific department positive sample in the corresponding disease-specific department positive sample sets to obtain N query sample pairs, and then distributing a disease-specific department negative sample for the query sample pairs according to the disease-specific department negative sample set corresponding to the disease query sample (i.e. the service query sample) contained in each query sample pair to obtain N disease query sample triplets.
According to the N disease query sample triplets, the M disease department positive sample sets and the M disease department negative sample sets, one feasible implementation process of the N disease query sample triplets corresponding to the positive weight parameter sets and the negative weight parameter sets respectively may be: traversing the N disease inquiry sample triplets, and sequentially acquiring disease inquiry samples in the h disease inquiry sample triplets to serve as target disease inquiry samples; h is a positive integer less than or equal to N; taking a special disease department positive sample set corresponding to the target disease inquiry sample as a target special disease department positive sample set, and taking a special disease department negative sample set corresponding to the target disease inquiry sample as a target special disease department negative sample set; generating a positive weight parameter set corresponding to the target disease inquiry sample according to the similarity relation between the target disease department positive sample set and the N disease inquiry sample ternary groups respectively comprising the disease department positive samples; and generating a negative weight parameter set corresponding to the target disease inquiry sample according to the similarity relation between the target disease department negative sample set and the disease department negative samples respectively included by the N disease inquiry sample ternary sets.
According to the similarity relationship between the target disease department positive sample set and the N disease query sample ternary sets respectively comprising the disease department positive samples, one possible implementation process of generating the positive weight parameter set corresponding to the target disease query sample may be: creating an initial positive weight parameter set; the initial positive weight parameter set is an empty set; taking the number of the special department positive samples contained in the target special department positive sample set as a first positive sample number; traversing N disease inquiry sample ternary groups to obtain special disease department positive samples respectively, and sequentially obtaining g special disease department positive samples; g is a positive integer less than or equal to N; taking the number of special department positive samples different from the g special department positive samples in the target special department positive sample set as a second positive sample number; according to the first positive sample number and the second positive sample number, determining weight parameters used for representing the similarity relationship between the target disease inquiry sample and the g special disease department positive sample; adding a weight parameter between the target disorder query sample and the g-th special disease department positive sample to an initial positive weight parameter set; when the N special disease department positive samples are traversed, an initial positive weight parameter set containing weight parameters between the target disease query sample and the N special disease department positive samples is used as a positive weight parameter set corresponding to the target disease query sample.
Step S103, training the initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model.
Specifically, the target service guide model is used for predicting a service class label corresponding to the service query information; each positive weight parameter is used for controlling the training influence of the similarity between a service inquiry sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
Specifically, training the initial service guide model according to N query sample triples, N positive weight parameters in each positive weight parameter set, and N negative weight parameters in each negative weight parameter set to obtain a feasible implementation process of the target service guide model, which may be: performing feature coding processing on the N query sample triples through an initial service guide model to obtain query sample vector triples corresponding to the N query sample triples respectively; one query sample vector triplet comprises a service query sample vector, a labeling service positive sample vector and a labeling service negative sample vector; traversing the N query sample triples, sequentially obtaining an f query sample triplet, taking a service query sample in the f query sample triplet as an f service query sample, and taking a labeling service positive sample in the f query sample triplet as an f labeling service positive sample; determining a loss function value corresponding to the f service query sample according to the N query sample vector triples, the positive weight parameter set and the negative weight parameter set corresponding to the f query sample triples; when the N query sample triples are traversed, model parameter adjustment is carried out on the initial service guide model according to the loss function values respectively corresponding to the N service query samples; and if the adjusted initial service guide model meets the model convergence condition, taking the adjusted initial service guide model as a target service guide model. Wherein the initial traffic guidance model may be built using a BERT (Bidirectional Encoder Representation from Transformer, a pre-trained language model) pre-trained model.
Specifically, according to the N query sample vector triples, the positive weight parameter set and the negative weight parameter set corresponding to the f query sample triples, one possible implementation process of determining the loss function value corresponding to the f service query sample may be: acquiring a query sample vector triplet corresponding to the f-th query sample triplet from the N query sample vector triples, and taking the query sample vector triplet as a target query sample vector triplet; according to the service query sample vector and the labeling service positive sample vector included in the target query sample vector triplet, determining a first similarity between an f service query sample and an f labeling service positive sample; determining second similarity between the f-th service query sample and each marked service positive sample in the N query sample triples according to the service query sample vector in the target query sample vector triples and each marked service positive sample vector in the N query sample triples; according to the service query sample vector in the target query sample vector triplet and each marked service negative sample vector in the N query sample triplets, determining a third similarity between the f service query sample and each marked service negative sample in the N query sample triplets; according to N positive weight parameters in the positive weight parameter set corresponding to the f query sample triplet, carrying out weight adjustment on N second similarity to obtain N first weight similarity; according to N negative weight parameters in the negative weight parameter set corresponding to the f query sample triplet, carrying out weight adjustment on N third weight similarities to obtain N second weight similarities; and determining a loss function value corresponding to the f service query sample according to the first similarity, the N first weight similarities and the N second weight similarities.
Specifically, the loss function corresponding to the f-th service query sample may be expressed as the following formula (3):
wherein the query sample vector triplet corresponding to the f-th query sample triplet, i.e. the target query sample vector triplet, may be expressed asWherein h is f Refers to a service query sample vector corresponding to a service query sample in the f-th query sample triplet,/->Refers to a labeling service positive sample vector corresponding to a labeling service positive sample in the f-th query sample triplet,/->And refers to a negative sample vector of the labeling service corresponding to the negative sample of the labeling service in the f-th query sample triplet. Similarly, let go of>Refers to a marking service positive sample vector corresponding to the marking service positive sample in the g-th query sample triplet,/->Refers to a negative sample vector of the labeling service corresponding to the negative sample of the labeling service in the g-th query sample triplet. From the above formulas (1) and (2), μ (f, g) + ) The weight parameters between the service query sample in the f-th query sample triplet and the labeling service positive sample in the g-th query sample triplet are referred to as g-th positive weight parameters in the positive weight parameter set corresponding to the f-th query sample triplet; mu (f, g) - ) And the weight parameters between the service query sample in the f-th query sample triplet and the labeling service negative sample in the g-th query sample triplet are referred to, namely the g-th negative weight parameter in the negative weight parameter set corresponding to the f-th query sample triplet.
Specifically, the loss function value corresponding to the f-th service inquiry sample can be determined through the loss function corresponding to the formula (3), and the server can determine the loss function values respectively corresponding to all the service inquiry samples according to the mode, so that model parameter adjustment is performed on the initial service guide model according to the loss function values respectively corresponding to all the service inquiry samples. It can be understood that the triples of different query samples may contain the same positive sample of the labeling service, and the corresponding vector representations of the triples should be very close, and the triples are directly used as negative samples, so that the model effect is damaged, and therefore, positive and negative weight parameters are introduced to measure the similarity of different samples, and when two samples are similar, the two samples are given smaller weight, the influence of the two samples is reduced, and the training effect of the model can be improved.
Specifically, in a medical scenario, training an initial service guide model according to N query sample triples, N positive weight parameters in each positive weight parameter set, and N negative weight parameters in each negative weight parameter set, to obtain a feasible embodiment of a target service guide model may be: training the initial business guiding model according to the N disease inquiry sample triplets, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a diagnosis guiding model of the department of the specific disease.
The training of the initial business guiding model according to the N disease inquiry sample triplets, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set, and obtaining a feasible implementation process of the special disease department guiding model can be as follows: performing feature coding processing on the N disease inquiry sample triplets through an initial service guide model to obtain inquiry sample vector triplets respectively corresponding to the N disease inquiry sample triplets; one query sample vector triplet includes a condition query sample vector, a department specific positive sample vector, and a department specific negative sample vector; traversing the N disease inquiry sample triplets, sequentially acquiring an f disease inquiry sample triplet, taking a disease inquiry sample in the f disease inquiry sample triplet as an f disease inquiry sample, and taking a special disease department positive sample in the f disease inquiry sample triplet as an f special disease department positive sample; determining a loss function value corresponding to the f disease inquiry sample according to the N disease inquiry sample vector triplets, the positive weight parameter set and the negative weight parameter set corresponding to the f disease inquiry sample triplets; when the N disease inquiry sample triplets are traversed, according to the loss function values respectively corresponding to the N disease inquiry samples, carrying out model parameter adjustment on the initial service guide model; and if the adjusted initial business guiding model meets the model convergence condition, taking the adjusted initial business guiding model as a diagnosis guiding model of the department of specific diseases.
Wherein, according to the N disorder query sample vector triplets, the positive weight parameter set and the negative weight parameter set corresponding to the f disorder query sample triplets, one possible implementation process of determining the loss function value corresponding to the f disorder query sample may be: acquiring a disease inquiry sample vector triplet corresponding to the f disease inquiry sample triplet from the N disease inquiry sample vector triplets, and taking the disease inquiry sample vector triplet as a target disease inquiry sample vector triplet; according to a disease inquiry sample vector and a special disease department positive sample vector included in the target disease inquiry sample vector triplet, determining first similarity between an f disease inquiry sample and an f special disease department positive sample; determining a second similarity between the f-th disease query sample and each disease department positive sample in the N disease query sample triplets according to the disease query sample vector in the target disease query sample vector triplets and each disease department positive sample vector in the N disease query sample triplets; determining a third similarity between the f-th disease query sample and each disease department negative sample in the N disease query sample triplets according to the disease query sample vector in the target query sample vector triplets and each disease department negative sample vector in the N disease query sample triplets; according to N positive weight parameters in the positive weight parameter set corresponding to the f disease query sample triplet, carrying out weight adjustment on N second similarity to obtain N first weight similarity; according to N negative weight parameters in the negative weight parameter set corresponding to the f disease query sample triplet, carrying out weight adjustment on N third weight similarities to obtain N second weight similarities; and determining a loss function value corresponding to the f disease query sample according to the first similarity, the N first weight similarities and the N second weight similarities.
By adopting the method provided by the embodiment of the application, the initial service query sample set can be screened through the service category probability threshold value contained in the service category rough prediction information, part of non-marked service query samples which do not exceed the service category probability threshold value are filtered, and then the rest of to-be-processed service query samples are configured with the marked service positive sample set and the marked service negative sample set, so that the training data acquisition efficiency is greatly improved, and the manual marking cost and the training data acquisition time are reduced; in addition, based on the labeling service positive sample set and the labeling service negative sample set respectively configured by the M to-be-processed service query samples, N query sample triplets for training the initial service guide model can be constructed, and positive weight parameters and negative weight parameters are introduced when the initial service guide model is trained based on the N query sample triplets, so that training data is expanded, training influence of similar samples in different query sample triplets on the initial service guide model is reduced, and accuracy of service class prediction can be further improved.
Further, referring to fig. 4, fig. 4 is a flow chart of a data processing method according to an embodiment of the present application. The method is used for obtaining the target service standardization model mentioned in the embodiment corresponding to fig. 3, and the method may be executed by a server, or may be executed by a terminal device, or may be executed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S201 to S204:
Step S201, at least two initial traffic class samples are obtained.
In particular, the server may obtain at least two initial traffic class samples from a local database or network data associated with the first traffic scenario. The initial business category samples refer to names or description texts corresponding to a certain business category, at least two initial business category samples should correspond to the same business scene, namely a first business scene, and the business category corresponding to each initial business category sample corresponds to a business with a specific type range. For example, in a government scenario, at least two initial service class samples obtained by acquiring data corresponding to service classes of 1000 online government systems may be { endowment insurance information management system, medical insurance service area, … …, retirement endowment area }, where endowment insurance information management system corresponds to the services of endowment insurance handling, endowment migration and migration of the user; the medical insurance service special area corresponds to the medical insurance transaction, reimbursement and other services of the user; the retirement and endowment area corresponds to the business of retirement, endowment and insurance of the user, and the immigration and immigration of endowment relation.
Specifically, in a medical scenario, one possible embodiment of obtaining at least two initial business class samples may be: at least two real department samples are obtained.
Step S202, screening coarse clustering processing is carried out on the at least two initial business category samples, and H business category coarse clustering sets are obtained.
Specifically, the at least two initial traffic class samples may include initial traffic class samples corresponding to standard traffic classes and nonsensical traffic classes, and may further include repeated initial traffic class samples, so that the at least two initial traffic class samples need to be filtered to filter out the repeated initial traffic class samples and the initial traffic class samples corresponding to the standard traffic classes and nonsensical traffic classes. The standard service class refers to a recognized or conventional service class, that is, a service class which is used by most enterprises or organizations by default, the service coverage of the standard service class is usually very large, and services corresponding to one standard service class can be further divided into a plurality of specific types of services, so that an initial service class sample corresponding to the standard service class can be filtered. The nonsensical service class refers to a service class with little service relevance to the first service scenario, that is, a service corresponding to the nonsensical service class is not generally a main service in the first service scenario. For example, in a medical scenario, at least two initial business class samples include { endocrinology, employee canteen, … …, thyroidism }, it will be appreciated that "thyroidism" is a distinctive business class that primarily treats thyropathy, but that "endocrinology" belongs to a standard three-level department system that is generally accepted for the range of disease for which it is responsible, and that it may contain multiple distinctive business classes such as "thyroidism", so that "endocrinology" belongs to a standard business class, the initial business class sample of "endocrinology" should be filtered out, and in addition, when business class sample acquisition is performed in big data related to the medical scenario, "employee canteen" is also acquired, but in fact the patient is not concerned about the canteen of the hospital when seeing a doctor, so that "employee canteen" belongs to a nonsensical sample class, and needs to be filtered out.
Specifically, after filtering at least two initial service class samples, it is assumed that Z service class samples are obtained, where there is generally a high overlap ratio of services corresponding to part of the service class samples, that is, the similarity of part of the service class samples is high, and the services corresponding to the two service classes include services of pension insurance transaction, pension relationship migration and export, for example, in the foregoing step S201, the service class samples with high similarity are classified into the same service class coarse clustering set, which is assumed that the overlap ratio of services corresponding to part of the service class samples is high.
Specifically, the coarse clustering process is performed on at least two initial service class samples to obtain a feasible implementation process of the H service class coarse clustering sets, which may be: performing de-duplication processing on at least two initial business category samples to obtain X de-duplication business category samples and frequency numbers corresponding to the X de-duplication business category samples respectively; the frequency corresponding to one de-duplication service class sample is the same number of initial service class samples as one de-duplication service class sample in at least two initial service class samples; x is a positive integer; obtaining duplicate-service-class samples with the frequency greater than or equal to a frequency threshold value from the X duplicate-service-class samples, and obtaining Y high-frequency service-class samples; y is a positive integer less than or equal to X; filtering and updating Y high-frequency business class samples according to the standard business class dictionary, the stop word dictionary and the auxiliary business class dictionary to obtain Z business class samples; z is a positive integer less than or equal to Y; and performing coarse clustering treatment on the Z business category samples to obtain H business category coarse clustering sets. The de-duplication processing refers to recording the number of the same initial business category samples as the frequency number, and then reserving one initial business category sample. For example, at least two initial traffic class samples include { xx1, xx2, xx1, xx1}, and after performing deduplication processing, { [ ' xx1',3], [ ' xx2',1] }, where [ (xx 1',3] means that the frequency of "xx1" is 3. The frequency threshold is a threshold configured in advance and used for filtering some rare business class samples, and the frequency threshold can be set according to practical situations, for example, the frequency threshold can be 3.
Specifically, filtering and updating the Y high-frequency service class samples according to the standard service class dictionary, the stop word dictionary and the auxiliary service class dictionary to obtain a feasible implementation process of the Z service class samples, which may be: traversing Y high-frequency service class samples, and sequentially acquiring an e-th high-frequency service class sample; e is a positive integer less than or equal to Y; matching the e-th high-frequency business class sample with an auxiliary business class dictionary; if the e-th high-frequency business class sample fails to be matched with the auxiliary business class dictionary, matching the e-th high-frequency business class sample with the stop word dictionary; if the e-th high-frequency business class sample is successfully matched with the stop word dictionary, taking the stop word matched with the e-th high-frequency business class sample in the stop word dictionary as a target stop word, and removing the target stop word from the e-th high-frequency business class sample to obtain an e-th transition high-frequency business class sample; if the e-th high-frequency business class sample fails to match with the stop word dictionary, the e-th high-frequency business class sample is used as an e-th transition high-frequency business class sample; matching the e-th transition high-frequency business class sample with a standard business class dictionary; if the e-th transition high-frequency business class sample fails to match with the standard business class dictionary, the e-th transition high-frequency business class sample is used as a business class sample; when Y high-frequency service class samples have been traversed, Z service class samples are obtained. The standard business category dictionary, the stop word dictionary and the auxiliary business category dictionary are dictionary constructed in advance, and the labeling business category dictionary comprises standard business categories and synonymous business categories thereof, for example, [ cardiovascular department, intracardiac department ] is a group of synonymous departments (synonymous business categories) in medical scenes; the stop word dictionary comprises words, usually nonsensical words such as stop words, numbers and the like which need to be filtered, for example, in a medical scene, the stop word dictionary can comprise [ 'minor principal', 'doctor', 'caddy', 'multidisciplinary', 'joint', 'comprehensive' ]; the auxiliary traffic class dictionary comprises keywords of nonsensical traffic classes, i.e. traffic classes that do not need to be guided in the first traffic scenario, e.g. in a medical scenario the auxiliary traffic class dictionary may comprise [ 'canteen', 'materials', 'community', 'social health', 'manpower', 'group commission', 'logistics', 'congress', 'laundry', 'equipment', 'store' ].
In short, the server screens at least two initial service class samples to obtain Z service class samples, and only needs to arrange at least two initial service class samples in a reverse order according to the occurrence frequency, and retains the initial service class samples with the frequency > =frequency threshold (for example, 3), so as to obtain Y high-frequency service class samples. Then the server traverses all the high-frequency business category samples again, and if the auxiliary business category dictionary is hit, the high-frequency business category samples are removed; and then the left high-frequency business class sample is deactivated to obtain a transition high-frequency business class sample, if the transition high-frequency business class sample hits a standard business class dictionary, the transition high-frequency business class sample belongs to the standard business class sample, the transition high-frequency business class sample needs to be removed, and finally the left transition high-frequency business class sample, namely Z business class samples which need to be subjected to coarse clustering.
Specifically, performing coarse clustering on the Z service class samples to obtain a feasible implementation process of the H service class coarse clustering set may be: b business class samples with highest frequency are obtained from the Z business class samples and used as B seed business class samples, and the B seed business class samples are added to an initial seed business class sample set; removing business class samples of B business class samples from the Z business class samples, and taking the business class samples as iterative business class samples; performing iterative coarse clustering on the initial seed service class sample set according to the iterative service class samples to obtain a coarse clustering seed service class sample set; the coarse cluster seed service class sample set comprises H seed service class samples; each seed traffic class sample is associated with a similar traffic class sample; and respectively constructing coarse clustering sets according to each seed business category sample in the coarse clustering seed business category sample set and similar business category samples associated with each seed business category sample to obtain H business category coarse clustering sets.
Specifically, performing iterative coarse clustering on the initial seed service class sample set according to the iterative service class samples to obtain a feasible implementation process of the coarse clustering seed service class sample set, which may be: sequentially acquiring a d-th iteration business class sample from the iteration business class samples, wherein d is a positive integer smaller than or equal to Z-B; calculating the distance similarity between the d-th iterative business class sample and each seed business class sample in the d-1 round iterative seed business class sample set; d is 1, the iterative seed business class sample set of the d-1 round is the initial seed business class sample set; if the maximum distance similarity in the distance similarity is smaller than the target similarity threshold, adding the d-th iteration business category sample to the seed business category sample set of the d-1 th round to obtain the iteration seed business category sample set of the d round, and continuously obtaining the d+1th iteration business category sample to perform coarse clustering on the iteration seed business category sample set of the d round until the iteration seed business category sample set of the Z-B round is used as a coarse clustering seed business category sample set; if the maximum distance similarity in the distance similarity is greater than or equal to the target similarity threshold, determining the d iteration business class sample as a similar business class sample associated with the seed business class sample corresponding to the maximum distance similarity; taking the iterative seed business class sample set of the d-1 round as the iterative seed business class sample set of the d round, and continuously obtaining the (d+1) iterative business class sample to perform coarse clustering treatment on the iterative seed business class sample set of the d round until the iterative seed business class sample set of the Z-B round is taken as the coarse clustering seed business class sample set.
The distance similarity may be an edit distance similarity or a Jaccard (Jaccard coefficient) distance similarity, or may be another similarity, or may be a weighted average of two or more similarities. The greater the distance similarity, the greater the similarity between the two samples. In the above process, if the distance similarity between d iterative traffic class samples and a certain seed traffic class sample is greater than the target similarity threshold, the d-th iterative traffic class is used as the similar traffic class sample associated with the seed traffic class sample, otherwise, the d-th iterative traffic class sample is used as the new seed traffic class sample.
Wherein each seed traffic class sample in the iterative seed traffic class sample set of round d-1 comprises a target seed traffic class sample; one possible implementation of calculating the distance similarity between the d-th iterative traffic class sample and each seed traffic class sample in the d-1 round of iterative seed traffic class sample set may be: and determining the coincident character number of the d-th iteration business category sample and the target seed business category sample, and the union character number corresponding to the d-th iteration business category sample and the target seed business category sample, and dividing the coincident character number and the union character number to obtain the distance similarity between the d-th iteration business category sample and the target seed business category sample. Wherein, the coincident character number refers to the same character number contained in the d iteration business category sample and the target seed business category sample; the union number of characters refers to the number of characters left after the d iteration business category sample and the target seed business category sample contain the characters for duplication removal. For example, the d iteration service type sample is "ethyl methyl acetone", the target seed service type sample is "ethyl methyl Ding Mao", the same characters are "ethyl methyl", "ethyl", and the coincident character number is 2; after the duplication of the two, the remaining characters are "A", "B", "C", "T", "C", and the number of the characters is 6.
The distance similarity corresponding to the d-th iterative service class sample is Jaccard (Jaccard coefficient) distance similarity, and the corresponding Jaccard distance formula is as follows:
wherein, S may be the above-mentioned d-th iteration business class sample, and O may be any one of the d-1-th iteration seed business class sample sets; S.u.O represents the coincident character in the two samples, S.u.O represents the union of the two sample characters.
Specifically, in the medical scenario, coarse clustering is performed on at least two initial service class samples to obtain a feasible embodiment of the H service class coarse clustering set, which may be: and screening and coarse clustering at least two real department samples to obtain H department coarse clustering sets with specific diseases.
The method comprises the steps of screening at least two real department samples to obtain a feasible implementation process of the H special department coarse clustering sets, wherein the feasible implementation process can be as follows: performing de-duplication treatment on at least two real department samples to obtain X de-duplication real department samples and frequency numbers corresponding to the X de-duplication real department samples respectively; the frequency corresponding to one de-duplicated real department sample is the number of real department samples which are the same as the de-duplicated real department sample in at least two real department samples; x is a positive integer; obtaining weight-removed real department samples with frequency greater than or equal to a frequency threshold from the X weight-removed real department samples, and obtaining Y high-frequency real department samples; y is a positive integer less than or equal to X; filtering and updating Y high-frequency real department samples according to a standard department dictionary, a stop word dictionary and an auxiliary department dictionary to obtain Z special-disease department samples; z is a positive integer less than or equal to Y; coarse clustering treatment is carried out on the Z special disease department samples, and H special disease department coarse clustering sets are obtained.
The filtering and updating process is performed on Y high-frequency real department samples according to a standard department dictionary, a stop word dictionary and an auxiliary department dictionary to obtain a feasible implementation process of Z special department samples, which can be as follows: traversing Y high-frequency real department samples, and sequentially acquiring an e-th high-frequency real sample; e is a positive integer less than or equal to Y; matching the e-th high-frequency real department sample with an auxiliary department dictionary; if the e-th high-frequency real department sample fails to match with the auxiliary department dictionary, matching the e-th high-frequency real department sample with the deactivated word dictionary; if the e-th high-frequency real department sample is successfully matched with the deactivated word dictionary, taking the deactivated word matched with the e-th high-frequency real department sample in the deactivated word dictionary as a target deactivated word, and removing the target deactivated word from the e-th high-frequency real department sample to obtain an e-th transition high-frequency real department sample; if the e-th high-frequency real department sample fails to match with the deactivated word dictionary, taking the e-th high-frequency real department sample as an e-th transition high-frequency real department sample; matching the e-th transition high-frequency real department sample with a standard department dictionary; if the e-th transition high-frequency real department sample fails to match with the standard department dictionary, the e-th transition high-frequency real department sample is used as a special department sample; when Y high-frequency real department samples are traversed, Z special-disease department samples are obtained.
Coarse clustering is performed on the Z department samples to obtain a feasible implementation process of the coarse clustering set of the H department samples, which may be: b special disease department samples with highest frequency are obtained from the Z special disease department samples and used as B seed special disease department samples, and the B seed special disease department samples are added to an initial seed special disease department sample set; removing disease-specific department samples of B disease-specific department samples from the Z disease-specific department samples, and taking the disease-specific department samples as iterative disease-specific department samples; performing iterative coarse clustering treatment on the initial seed dedicated disease department sample set according to the iterative dedicated disease department sample to obtain a coarse clustering seed dedicated disease department sample set; the coarse clustering seed specific disease department sample set comprises H seed specific disease department samples; each seed specific department sample is associated with a similar specific department sample; and respectively constructing coarse clustering sets according to each seed specific disease department sample in the coarse clustering seed specific disease department sample set and similar specific disease department samples associated with each seed specific disease department sample to obtain H specific disease department coarse clustering sets.
Specifically, performing iterative coarse clustering on the initial seed specific disease department sample set according to the iterative specific disease department sample to obtain a feasible implementation process of the coarse clustering seed specific disease department sample set, which may be: sequentially acquiring a d-th iterative special disease department sample from the iterative special disease department samples, wherein d is a positive integer smaller than or equal to Z-B; calculating the distance similarity between the d-th iterative disease-specific department sample and each seed disease-specific department sample in the d-1 round of iterative seed disease-specific department sample set; d is 1, and the iterative seed specific disease department sample set of the d-1 th round is an initial seed specific disease department sample set; if the maximum distance similarity in the distance similarity is smaller than the target similarity threshold, adding the (d) iteration special disease department sample to the (d-1) th round of seed special disease department sample set to obtain the (d) th round of iteration seed special disease department sample set, and continuously obtaining the (d+1) th round of iteration special disease department sample to perform coarse clustering treatment on the (d) th round of iteration seed special disease department sample set until the (Z-B) th round of iteration seed special disease department sample set is used as a coarse clustering seed special disease department sample set; if the maximum distance similarity in the distance similarity is greater than or equal to the target similarity threshold, determining the d iteration special disease department sample as a similar special disease department sample associated with the seed special disease department sample corresponding to the maximum distance similarity; taking the iterative seed dedicated disease department sample set of the d-1 round as the iterative seed dedicated disease department sample set of the d round, and continuously obtaining the (d+1) iterative dedicated disease department sample to perform coarse clustering treatment on the iterative seed dedicated disease department sample set of the d round until the iterative seed business class sample set of the Z-B round is taken as the coarse clustering seed dedicated disease department sample set.
Step S203, constructing P business category sample sets according to the H business category coarse clustering sets; p is a positive integer.
Specifically, one service class sample set includes H service class sample pairs; the business category samples in the business category sample pair belonging to the same business category sample set respectively belong to different business category coarse clustering sets, and each business category sample pair comprises business category positive samples corresponding to the contained business category samples; and the service class positive sample refers to a service class sample with the highest corresponding frequency in a service class coarse clustering set.
Specifically, in a medical scenario, one possible embodiment of constructing P service class sample sets according to the H service class coarse cluster sets may be: and constructing P special department sample sets according to the H special department coarse clustering sets. Wherein P is a positive integer; the special department sample set comprises H special department sample pairs; the special department samples in the special department sample pair belonging to the same special department sample set respectively belong to different special department coarse clustering sets, and each special department sample pair comprises standard special department samples corresponding to the contained special department samples; a standard department sample refers to a department sample with highest corresponding frequency in a crude clustering set of the department.
And step S204, performing iterative training on the initial service standardization model according to the P service class sample sets to obtain a target service standardization model.
Specifically, the iterative training is performed on the initial service standardization model according to the P service class sample sets, so as to obtain a feasible implementation process of the target service standardization model, which may be: sequentially acquiring a q-th business category sample set in the P business category sample sets; q is a positive integer less than or equal to P; h service class sample pairs in the q-th service class sample set are used as H service class sample pairs to be processed; performing feature coding processing on H service class sample pairs to be processed through a q-1 round of iterative service standardization model to obtain H service class vector pairs; a traffic class vector pair comprising a traffic class sample vector and a traffic class positive sample vector; when q=1, the iteration business standardization model of the q-1 th round is an initial business standardization model; then, according to the service class sample vector and the service class positive sample vector in each service class vector pair and the negative sample relation between different service class vector pairs, determining the loss function values respectively corresponding to the service class samples included in the H service class sample pairs to be processed; training the q-1 th round of iterative service standardization model according to the H loss function values to obtain the q-1 th round of iterative service standardization model; if q is equal to H, the q-th round of iterative service standardization model is used as a target service standardization model; if q is smaller than H, continuing to acquire a (q+1) th service class sample set, and training the iterative service standardization model of the (q) th round through the (q+1) th service class sample set. The initial business standardization model can be constructed by adopting a BERT pre-training model.
Specifically, the feature coding processing is performed on the H to-be-processed service class sample pairs through the q-1 th round of iterative service standardization model to obtain a feasible implementation process of the H service class vector pairs, which may be: and acquiring sequence representations corresponding to the service class samples and the service class positive samples in each service class sample pair to be processed through a q-1 round of iterative service standardization model, and sequentially carrying out average pooling on the sequence representations corresponding to the service class samples and the service class positive samples in the H service class sample pairs to be processed, so as to obtain service class sample vectors corresponding to the service class samples in the H service class sample pairs to be processed respectively and service class positive sample vectors corresponding to the service class positive samples respectively. For ease of understanding, the i-th traffic class sample in the H pending traffic class sample pairs is taken as an example, and the i-th traffic class sample may be expressed as x i =[x 1 ,x 2 ,…,x n ]Wherein x is n The nth character contained in the ith service class sample is represented, and the sequence representation [ e ] of the ith service class sample can be obtained through the (i.e. the (q-1) th round of iterative service standardization model (i.e. the (q-1) th round of trained BERT pre-training model) 1 ,e 2 ,…,e n ]Wherein e is n And the characterization vector corresponding to the nth character contained in the ith service class sample is used. Then, pair [ e ] 1 ,e 2 ,…,e n ]Each dimension adopts an average pooling method to obtain the sentence vector representation h of the ith business class sample i I.e. the i-th traffic class sample vector. Alternatively, a maximum pooling method can be used to extract special characters of sentence head [ CLS ]]And obtaining sentence vectors of the business category text in other modes.
Specifically, the H pairs of service class samples to be processed may be one batch (batch sample), where the size of the batch is H. Each business category sample has a corresponding business category positive sample, and other business category samples can be regarded as corresponding negative samples, so that the q-1 th round of iterative business standardization model is trained through a batch, the distance between positive sample vectors can be shortened, and the distance between the negative sample vectors can be shortened. Specifically, the loss function of the ith traffic class sample may be expressed as the following formula:
wherein,a similarity function representing the i-th business class sample and the corresponding business class positive sample thereof, generally adopts cosine similarity +. >τ is the temperature parameter, sim (h i ,h j ) Representing the similarity between the ith traffic class sample and its negative samples, N is the batch size.
Specifically, in a medical scenario, performing iterative training on an initial service standardization model according to P service class sample sets to obtain a feasible embodiment of a target service standardization model, which may be: and carrying out iterative training on the initial business standardization model according to the P special department sample sets to obtain the special department standardization model.
The method comprises the steps of performing iterative training on an initial business standardization model according to P special department sample sets to obtain a feasible implementation process of the special department standardization model, wherein the feasible implementation process can be as follows: sequentially acquiring a q-th special disease department sample set in the P special disease department sample sets; q is a positive integer less than or equal to P; taking H special disease department sample pairs in the q special disease department sample set as H special disease department sample pairs to be processed; performing feature coding processing on H to-be-processed department sample pairs through a q-1 round of iterative department standardization model to obtain H department vector pairs; the special department vector pair comprises a special department sample vector and a standard special department vector; when q=1, the iterative special disease department standardized model of the q-1 th round is an initial business standardized model; then, according to the special disease department sample vector in each special disease department vector pair, the standard special disease department vector and the negative sample relation between different special disease department vector pairs, determining loss function values respectively corresponding to special disease department samples included in the H special disease department sample pairs to be processed; training the q-1 th round of iterative special disease department standardized model according to the H loss function values to obtain the q-th round of iterative special disease department standardized model; if q is equal to H, the q-th round of iterative special disease department standardized model is used as a target business standardized model; if q is smaller than H, continuing to acquire a (q+1) th special disease department sample set, and training the q-th round of iterative special disease department standardized model through the (q+1) th special disease department sample set.
By adopting the method provided by the embodiment of the application, coarse clustering treatment can be screened for the acquired at least two initial business category samples, repeated and nonsensical initial business category samples can be filtered, so that the concentration of an effective sample in the business category samples is improved, the training effect of the model is improved, and the accuracy of the finally obtained target business standardization model is high.
Further, in order to better understand the application of the data processing method provided in the embodiment corresponding to fig. 3 and fig. 4 in the actual service scenario, a medical scenario is taken as an example for illustration. Referring to fig. 5, fig. 5 is a training schematic diagram of a semi-supervised contrast learning dedicated disease matching model according to an embodiment of the present application. It will be appreciated that the model training process shown in fig. 5 may be implemented by a server, which may be the server 20a in the embodiment described above with respect to fig. 2. As shown in fig. 5, the server may train to obtain a department standardization model 53 of the department of the specific disease through the department of the specific disease coarse clustering corpus 51 obtained by the discovery of the specific disease. The implementation process of the specific disease discovery, that is, the implementation process of step S201 to step S202 in the embodiment corresponding to fig. 4, at this time, the initial business class sample is a real department in the medical scene; the coarse clustering corpus 51 of the department of disease is the coarse clustering set of the H business categories described in the embodiment corresponding to fig. 4; the training process of the standardized model 53 of the department of disease may refer to the implementation process of step S203-step S204 in the embodiment corresponding to fig. 4; the department of disease standardization model 53 is the target business standardization model described in the embodiment corresponding to fig. 4. Then, the server needle can predict the label-free query text 52 through the special department standardized model to obtain a pseudo-label query text 54, then the data (the ratio is 8:2) with the predicted result and the data without the predicted result are screened from the pseudo-label query text 54, verification labeling is carried out to obtain a labeled query text 55, and finally the special department guided diagnosis model 56 is trained by utilizing the labeled query text 55. The unlabeled query text 52 is L unlabeled service query samples in the initial service query sample set in the embodiment corresponding to fig. 3; the pseudo tag query text 54 is a sample of a label-free service query corresponding to the service class prediction information, a sample in a first prediction result sample set is a sample with a prediction result, and a sample in a second prediction result sample set is a sample without a prediction result; the tagged query text 55 is a sample of the service query to be processed as described in the embodiment corresponding to fig. 3 above; the training of the diagnosis guiding model 56 for dedicated department of disease may be based on a variable weight comparison learning method to design a loss function, and the specific implementation may be referred to the above-mentioned process from step S101 to step S103 in the embodiment corresponding to fig. 3.
The method comprises the steps of carrying out department diagnosis of the department by adopting a semi-supervised contrast learning department matching model shown in the embodiment of the application, firstly carrying out department discovery of the department of the general disease by using a rough clustering method, constructing a department standard model of the department of the general disease by using label-free data obtained by the rough clustering, then operating the department standard model of the department of the general disease in large-scale online user input main complaint data (namely query text) to obtain a sample possibly containing the department of the general disease (namely pseudo-label query text), carrying out labeling inspection on the result to obtain a labeled query text, and then constructing the department diagnosis guiding model of the department of the general disease based on variable weight contrast learning. The method can effectively improve the text matching effect of the model and the classification accuracy rate of the special disease guide and diagnosis service, and simultaneously adopts a variable weight comparison learning method to design a loss function, so that the situation of multi-label classification (in a medical scene, the same complaint belongs to a plurality of special disease departments) can be better solved. In addition, the real department name and the large-scale on-line data can be effectively utilized, and the problem of high labeling cost caused by low concentration of the special department complaints in the large-scale data is solved.
Further, referring to fig. 6, fig. 6 is a flow chart of a data processing method according to an embodiment of the present application. The method may be performed by a server, or may be performed by a terminal device, or may be performed by a server and a terminal device together, where the server may be the server 20a in the embodiment corresponding to fig. 2, and the terminal device may be the terminal device 20b in the embodiment corresponding to fig. 2. For ease of understanding, embodiments of the present application will be described in terms of this method being performed by a server. The data processing method may include the following steps S301 to S305:
Step S301, obtaining target service query information.
Specifically, the target service query information refers to query information input by an object, for example, a medical scene, the object can input "lumbar muscle strain" on an online platform of a hospital to query a doctor department needing registration, and the "lumbar muscle strain" is the target service query information.
Step S302, carrying out service guidance prediction processing on the target service inquiry information through the target service guidance model to obtain a target service class label corresponding to the target service inquiry information.
Specifically, the target business class label may be a number or other character information with identification meaning, and a target business class label may be mapped with a business class.
Step S303, obtaining the real business category information matched with the target business inquiry information.
Specifically, the real traffic class information refers to traffic class information that is currently available for the object to select and is not standardized. For example, a department of solar hospitals including a wound-making clinic, a sports injury clinic, a sleeping clinic, etc., the "wound-making clinic" is a real business category information. It will be appreciated that the actual traffic class information currently available for selection by the object is typically more than one, from which the actual traffic class information matching the target traffic query information may be first screened out.
Optionally, when the service class corresponding to the target service class label predicted by the target service guidance model is a special service class, each special service class corresponds to a specific type of service, the coverage area is small, and the multiple special service classes generally belong to the same standard service class, for example, in a medical scenario, the endocrinology department includes special disease departments such as obesity and metabolism outpatient service, thyroid outpatient service, diabetes outpatient service, and the like, where the endocrinology department corresponds to the standard service class, and obesity and metabolism outpatient service, thyroid outpatient service, and diabetes outpatient service are all special service classes belonging to the endocrinology department. At this time, a feasible implementation process of obtaining the real service class information matched with the target service query information may be: and determining a standard service class matched with the target service inquiry information as a target standard service class, and acquiring real service class information belonging to the target standard service class from all real service class information as real service class information matched with the target service inquiry information.
Step S304, carrying out business category rough prediction processing on the real business category information through the target business standardization model to obtain a real business category label corresponding to the real business category information.
Specifically, the real business category information corresponding to the same business category available for the selection of the object by different enterprises or organizations is different, for example, in a medical scene, a department of treatment for lumbar muscle injury in a hospital is "lumbago clinic", and a department of treatment for lumbar muscle loss in a hospital is "muscle injury", so that the obtained real business category information needs to be subjected to business category rough prediction processing to determine a corresponding real business category label. It will be appreciated that the real traffic class label may be mapped with a standardized traffic class through which real traffic class information corresponding to the same real traffic class label may be represented.
Step S305, generating real business guiding information according to the real business category information corresponding to the real business category label identical to the target business category label, and displaying the real business guiding information.
Specifically, the real business class information corresponding to the real business class label with the same target business class label is the business class to be selected by the object, and the real business guiding information can guide the object at the moment. For example, the object a queries on an online platform corresponding to the hospital a, and according to the prediction, the standardized business category corresponding to the target business category label is "sports injury outpatient", but in the hospital a, the outpatient is actually "lumbago outpatient", and at this time, the real business guiding information may be "you should select the lumbago outpatient".
In order to better understand the application of the data processing method in the actual service scene, the application of the data processing method in the hospital scene is taken as an example for explanation. In a medical scenario, the business category may correspond to a medical department contained in a hospital. There are a large number of common general departments among the real departments contained in different hospitals, such as "internal medicine", "surgery", "orthopedics", "paediatrics", "respiratory medicine", "gastroenterology", which can be used as standard departments (corresponding to standard business categories). Meanwhile, different hospitals have a large number of special departments (corresponding to special business categories), such as 'obesity and metabolism outpatient service', 'diabetes special department outpatient service', 'diabetes retinopathy outpatient service', and the like, and the departments usually treat a specific type of diseases and can be called as special departments and special departments. In order to better help patients to visit, hospitals can provide medical diagnosis separating functions through corresponding online platforms, and special diseases/special departments in hospitals are mainly supported, namely when patients input a section of inquiry text of personal illness descriptions (namely target business inquiry information) through an online platform running in terminal equipment, the terminal equipment (such as any one of the terminal equipment clusters shown in the above-mentioned figure 1) can acquire the inquiry text and send the inquiry text to a background server (such as the server 100 shown in the above-mentioned figure 1) corresponding to the online platform, and the background server can determine a real department corresponding to the inquiry text through a diagnosis separating module. The diagnosis module can comprise a special disease department standardization module and a special disease guide diagnosis module, wherein the special disease department standardization module comprises a special disease department standardization model and a special disease department guide diagnosis model. The target service standardization model is the target service standardization model, and it can be understood that, when at least two initial service class samples in the embodiment corresponding to fig. 4 are samples associated with a medical scene, the final target service standardization model can be used as the target service standardization model by the data processing method described in step S201-step S204. The target service guide model is the diagnosis guiding model of the department of the specific disease, and it can be understood that when the initial service query sample set in the embodiment corresponding to fig. 3 is the sample set associated with the medical scene, the finally obtained target service guide model can be used as the diagnosis guiding model of the department of the specific disease by the data processing method described in step S101-step S103.
Further, referring to fig. 7a, fig. 7a is a schematic diagram of a treatment flow of a standardized module for a department of disease provided in an embodiment of the present application. The process flow may be executed by the background server or may be executed by the terminal device, and for ease of understanding, the following process flow is described by taking the method as an example to be executed by the background server.
As shown in fig. 7a, the whole flow includes the following steps S401 to S410:
step S401, the background server acquires a real department.
Specifically, before the query text input by the target hospital for the patient by applying the triage module for the first time matches the real department, it is necessary to determine the department type of the real department contained in the query text and whether the real department corresponds to a standard specific disease department (i.e. a standardized business class) or a standard department.
In step S402, the background server matches the real department with the rules of specific diseases.
Specifically, the rule of specific diseases may include some constraint rules, for example, a blacklist, and some forbidden words or constraint words may be included in the blacklist, and if the matching between the real department and the blacklist is successful, no further steps need to be performed. The setting of the rules for specific diseases may be determined according to practical situations, and the application is not limited herein. The limitation condition of the department of special diseases can be processed more flexibly through the rule of special diseases.
Specifically, if the real department does not meet the constraint rule, step S403 may be continuously performed.
Step S403, the background server predicts the standard department of specific diseases corresponding to the real department through the standardized model of the department of specific diseases.
Specifically, the background server may predict the real department (i.e., the business class rough prediction process described in the above step S304) through the special department standardization model, so as to obtain the department tag corresponding to the real department (i.e., the real business class tag described in the above step S304) and the hit probability.
In step S404, the background server determines whether to hit the specific disease according to the hit probability.
Specifically, if the hit probability is greater than or equal to the hit threshold, the background server may determine that the hit specific disease is successful, then take the standardized specific disease department corresponding to the department label (corresponding to the standardized business class described in step S304 above) as the standard specific disease department corresponding to the real department, and then execute step S405.
Specifically, if the hit probability is smaller than the hit threshold, the background server may determine that the hit has failed, and execute step S406.
In step S405, the background server may determine that the department type corresponding to the real department=1.
Specifically, department type=1, the background server writes the real department, the standard specific disease department corresponding to the real department, and the department type (zb_type) corresponding to the real department into the department database, and then step S410 is performed.
Step S406, the background server carries out entity identification on the real departments, and if the entity result contains diseases or symptoms, the step S407 is executed; otherwise, step S408 is performed.
Step S407, the background server calculates the literal similarity of the real departments and all the standard departments, determines whether the maximum literal similarity is greater than a similarity threshold, and if so, executes step S408; if not, step S409 is performed.
Specifically, the maximum literal similarity refers to the maximum literal similarity in all the literal similarities, and the standard department corresponding to the maximum literal similarity is the similar standard department corresponding to the real department. The similarity threshold may take on a value according to the actual situation, for example, the similarity threshold may be 0.7.
In step S408, the background server may determine that the department type corresponding to the real department=0.
Specifically, department type=0, which means that the real department does not hit the department dedicated to the disease, and does not include entities such as disease symptoms or the like or is similar to the standard department. If the real department corresponds to the similar standard department, writing the real department, the similar standard department corresponding to the real department and the department type corresponding to the real department into a department database; if the real department does not have the corresponding similar standard department, the real department and the department type corresponding to the real department are written into the department database.
In step S409, the background server determines that the department type corresponding to the real department=2.
Specifically, department type=2, which indicates that the real department is a suspected department with specific disease, namely belongs to the department with specific disease, but can be given the department type=2 not in the department system with specific disease of the existing model; and then writing the real department and the department type corresponding to the real department into a department database. The subsequent model manager can acquire corresponding real departments of all department types=2 from the department database, so as to optimize the special-disease department standardized model, and the special-disease department standardized model can predict more special-disease departments.
In step S410, the background server outputs the department type corresponding to the real department.
After the process of the step S401-step S410 is performed on all the real departments in the target hospital by the department standardization module, the background server can triage the query text input by the patient through the department diagnosis guiding module. For ease of understanding, please refer to fig. 7b, fig. 7b is a schematic diagram of a treatment flow of a diagnosis-specific module according to an embodiment of the present application. The process flow may be executed by the background server or may be executed by the terminal device, and for ease of understanding, the following process flow is described by taking the method as an example to be executed by the background server. As shown in fig. 7b, the whole flow includes the following steps S501-S505:
In step S501, the background server obtains the query text.
Specifically, the patient can input a complaint query, i.e. a query question, through an online platform corresponding to the hospital running in the terminal device, and then the terminal device can generate a query text and send the query text to the background server.
Specifically, the query text may include information such as age and sex of the patient, in addition to the complaint query entered by the patient.
Step S502, the background server predicts the standard department with special diseases corresponding to the query text through the department with special diseases guide and diagnosis model.
Specifically, the background server may predict the real department through the diagnosis guiding model of the department for the specific disease (i.e. the service guiding prediction process described in the above step S302), so as to obtain the predicted department tag corresponding to the query text (i.e. the target service class tag described in the above step S302) and the hit probability. When the hit probability is greater than the hit threshold, step S503 is executed; if the hit probability is smaller than the hit threshold, the background server determines that the query text does not have a corresponding department for specific diseases.
In step S503, the background server matches the query text with the specific disease rule.
In particular, the specific disease rules may include some restriction rules, such as age restriction rules, gender restriction rules, blacklists, and the like. For some special cases, for example, when the age and sex of the patient are not suitable for being treated by the special department corresponding to the prediction department label, the special department can be limited by the special disease rule, so that the recommendation of the special disease department is more accurate and flexible. If the query text is successfully matched with the special disease rules, the background server determines that the query text does not have a corresponding special disease department; if the matching of the query text and the specific disease rule is unsuccessful, the background server can determine the specific disease department corresponding to the query text according to the prediction department label, and the specific disease is assumed to be 1.
In step S504, the background server obtains the real department matched with the specific disease 1 from the department database.
Specifically, through the steps S401-S410, the background server has already determined the specific department situation corresponding to all the real departments in the current hospital, so that the specific department (1) can be directly matched, and the specific department (1) matched with the specific department (1) is determined.
In step S505, the background server outputs the real department (1).
Specifically, the background server may return (1) the actual information to the terminal device of the patient.
Optionally, for the case that the background server determines that the query text does not correspond to a department with a specific disease, the background server may match the query text with a standard department, and determine the standard department corresponding to the query text.
By adopting the method provided by the embodiment of the application, the patient complaint can be intelligently understood, the inquiry requirements of the patient and the clinic resources of the hospital are optimally matched, the recommendation of the real department is performed according to the actual department composition of the hospital, the service flow is optimized, and the medical guiding service efficiency is improved.
Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. The data processing apparatus may be a computer program (including program code) running on a computer device, for example the data processing apparatus is an application software; the device can be used for executing corresponding steps in the data processing method provided by the embodiment of the application. As shown in fig. 8, the data processing apparatus 1 may include: a first acquisition module 11, a label screening module 12, a sample processing module 13, and a first training module 14.
A first obtaining module 11, configured to obtain service class rough prediction information corresponding to an initial service query sample set;
the labeling screening module 12 is configured to configure a labeling service positive sample set and a labeling service negative sample set for M to-be-processed service query samples in the initial service query sample set according to the service category rough prediction information; m is a positive integer;
the sample processing module 13 is configured to perform data preprocessing on M to-be-processed service query samples, M labeling service positive sample sets, and M labeling service negative sample sets, so as to obtain N query sample triples, and a positive weight parameter set and a negative weight parameter set corresponding to the N query sample triples respectively; each query sample triplet comprises service query samples belonging to M service query samples to be processed, marking service positive samples belonging to M marking service positive sample sets and marking service negative samples belonging to M marking service negative sample sets; n is a positive integer greater than or equal to M;
a first training module 14, configured to train the initial service guiding model according to the N query sample triples, the N positive weight parameters in each positive weight parameter set, and the N negative weight parameters in each negative weight parameter set, to obtain a target service guiding model; the target service guide model is used for predicting service class labels corresponding to the service inquiry information; each positive weight parameter is used for controlling the training influence of the similarity between a service inquiry sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
The specific implementation manners of the first obtaining module 11, the labeling screening module 12, the sample processing module 13, and the first training module 14 may be referred to the description of step S101 to step S103 in the embodiment corresponding to fig. 3, and will not be repeated here.
The initial service query sample set comprises L unlabeled service query samples; l is a positive integer greater than or equal to M; the business category rough prediction information comprises sample business category rough prediction information corresponding to L unlabeled business inquiry samples respectively;
the annotation screening module 12 comprises: a first screening unit 121, a second screening unit 122, and a generating unit 123.
A first filtering unit 121, configured to traverse sample service class rough prediction information corresponding to each of the L unlabeled service query samples, and sequentially obtain sample service class rough prediction information corresponding to the i-th unlabeled service query sample, as i-th sample service class rough prediction information; i is a positive integer less than or equal to L;
the first filtering unit 121 is further configured to add the i-th unlabeled service query sample to the first prediction result sample set if the i-th sample service class rough prediction information includes a sample service class probability greater than or equal to the service class probability threshold;
The first filtering unit 121 is further configured to add the i-th unlabeled service query sample to the second prediction result sample set if the i-th sample service class rough prediction information does not include a sample service class probability greater than or equal to the service class probability threshold;
the second screening unit 122 is configured to obtain M unlabeled service query samples from the first prediction result sample set as service query samples to be processed when the L sample service class rough prediction information has been traversed, obtain a unlabeled service query samples from the second prediction result sample set as a difficult negative samples, and add the a difficult negative samples to the difficult negative sample set; a is a positive integer, and the proportional relation between A and M meets the preset proportional condition;
the generating unit 123 is configured to configure a labeling service positive sample set and a labeling service negative sample set for the M to-be-processed service query samples according to the difficult negative sample set and the sample service class rough prediction information corresponding to the M to-be-processed service query samples, respectively.
The specific implementation manner of the first filtering unit 121, the second filtering unit 122, and the generating unit 123 may be referred to the description of step S101 in the embodiment corresponding to fig. 3, and will not be described herein.
Wherein the M service query samples to be processed comprise service query samples M to be processed j J is a positive integer less than or equal to M; pending service query sample M j The corresponding sample business category rough prediction information comprises B sample business category rough prediction information pairs; the coarse prediction information pair of one sample service class comprises one sample service class and one service class probability; b is a positive integer;
the generating unit 123 includes: the first set creating subunit 1231, and the set updating subunit 1232 are the first set determining subunit 1233.
A first set creation subunit 1231 for creating a pending service query sample M j A corresponding initial positive sample set and initial negative sample set; the initial positive sample set and the initial negative sample set are empty sets;
a set update subunit 1232 for traversing the service query sample M to be processed j The corresponding B sample business category rough prediction information pairs sequentially acquire the kth sample business category and the kth sample business category probability; k is a positive integer less than or equal to B;
the set updating subunit 1232 is further configured to, if the probability of the kth sample service class is greater than or equal to the service class probability threshold, perform sample matching on the kth sample service class according to the positive and negative sample matching rule, to obtain a sample matching result;
Set update subunit 1232, further for taking the kth sample service class as the service query sample M to be processed if the sample matching result of the kth sample service class is a positive sample result j Corresponding marked service positive samples, inquiring the service to be processed into a sample M j Adding the corresponding labeling service positive sample to the initial positive sample set;
the set updating subunit 1232 is further configured to, if the sample matching result of the kth sample service class is a negative sample result, use the kth sample service class as the service query sample M to be processed j Corresponding negative sample of marking service, inquiring sample M of service to be processed j Adding the corresponding negative sample of the labeling service to the initial negative sample set;
the set updating subunit 1232 is further configured to, if the B sample service class rough prediction information pairs have been traversed and the initial negative sample set is an empty set, obtain a difficult negative sample from the difficult negative sample set as the service query sample M to be processed j Corresponding negative sample of marking service, inquiring sample M of service to be processed j Adding the corresponding negative sample of the labeling service to the initial negative sample set;
a first set determining subunit 1233 for determining that the pending service query sample M has been added j The initial positive sample set of the corresponding marked service positive sample is determined to be the service query sample M to be processed j The corresponding marked service positive sample set adds the service query sample M to be processed j The initial negative sample set of the corresponding marked service negative sample is determined as a service query sample M to be processed j And the corresponding negative sample set of the labeling service.
For a specific implementation manner of the first set creating subunit 1231 and the first set determining subunit 1233 of the set updating subunit 1232, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, and the detailed description will not be repeated here.
Wherein the sample processing module 13 comprises: pairing unit 131, first traversing unit 132, positive parameter generating unit 133, and negative parameter generating unit 134.
The pairing unit 131 is configured to perform a combined pairing process on the M to-be-processed service query samples, the M labeling service positive sample sets, and the M labeling service negative sample sets, to obtain N query sample triples;
a first traversing unit 132, configured to traverse the N query sample triples, and sequentially obtain service query samples in the h query sample triples as target service query samples; h is a positive integer less than or equal to N;
The first traversing unit 132 is further configured to use a labeling service positive sample set corresponding to the target service query sample as a target labeling service positive sample set, and a labeling service negative sample set corresponding to the target service query sample as a target labeling service negative sample set;
a positive parameter generating unit 133, configured to generate a positive weight parameter set corresponding to the target service query sample according to a similarity relationship between the target labeling service positive sample set and the labeling service positive samples respectively included in the N query sample triples;
the negative parameter generating unit 134 is configured to generate a negative weight parameter set corresponding to the target service query sample according to a similarity relationship between the target labeling service negative sample set and labeling service negative samples respectively included in the N query sample triples.
The specific implementation manner of the pairing unit 131, the first traversing unit 132, the positive parameter generating unit 133, and the negative parameter generating unit 134 may be referred to the description of step S102 in the embodiment corresponding to fig. 3, and will not be described herein.
Wherein the positive parameter generating unit 133 includes: a second set creation subunit 1331, a parameter generation subunit 1332, and a second set determination subunit 1333.
A second set creation subunit 1331 for creating an initial positive weight parameter set; the initial positive weight parameter set is an empty set;
a parameter generating subunit 1332, configured to take, as the first positive sample number, the number of labeling service positive samples included in the target labeling service positive sample set;
the parameter generating subunit 1332 is further configured to traverse the labeling service positive samples respectively included in the N query sample triples, and sequentially obtain a g-th labeling service positive sample; g is a positive integer less than or equal to N;
the parameter generating subunit 1332 is further configured to use, as the second positive sample number, the number of labeling service positive samples different from the g-th labeling service positive sample in the target labeling service positive sample set;
the parameter generating subunit 1332 is further configured to determine, according to the first positive sample number and the second positive sample number, a weight parameter for characterizing a similarity relationship between the target service query sample and the g-th labeled service positive sample;
a parameter generation subunit 1332, configured to add a weight parameter between the target service query sample and the g-th labeled service positive sample to the initial positive weight parameter set;
the second set determining subunit 1333 is configured to, when the N positive samples of the labeling service have been traversed, use an initial positive weight parameter set that includes weight parameters between the target service query sample and the N positive samples of the labeling service as a positive weight parameter set corresponding to the target service query sample.
The specific implementation manner of the second set creating subunit 1331, the parameter generating subunit 1332, and the second set determining subunit 1333 may refer to the description of step S102 in the embodiment corresponding to fig. 3, and will not be repeated here.
Wherein the first training module 14 comprises: encoding section 141, sample determination section 142, loss determination section 143, parameter adjustment section 144, and model determination section 145.
The encoding unit 141 is configured to perform feature encoding processing on the N query sample triples through the initial service guide model, so as to obtain query sample vector triples corresponding to the N query sample triples respectively; one query sample vector triplet comprises a service query sample vector, a labeling service positive sample vector and a labeling service negative sample vector;
the sample determining unit 142 is configured to traverse the N query sample triples, sequentially obtain the f query sample triples, use the service query sample in the f query sample triples as the f service query sample, and use the labeling service positive sample in the f query sample triples as the f labeling service positive sample;
a loss determining unit 143, configured to determine a loss function value corresponding to the f-th service query sample according to the N query sample vector triples, the positive weight parameter set and the negative weight parameter set corresponding to the f-th query sample triplet;
The parameter adjustment unit 144 is configured to perform model parameter adjustment on the initial service guide model according to the loss function values corresponding to the N service query samples when the N query sample triples have been traversed;
the model determining unit 145 is configured to take the adjusted initial traffic guidance model as the target traffic guidance model if the adjusted initial traffic guidance model meets the model convergence condition.
The specific implementation manners of the encoding unit 141, the sample determining unit 142, the loss determining unit 143, the parameter adjusting unit 144, and the model determining unit 145 may be referred to the description of step S103 in the embodiment corresponding to fig. 3, and will not be described herein.
Wherein the loss determination unit 143 includes: a similarity determination subunit 1431 and a loss determination subunit 1432.
A similarity determining subunit 1431, configured to obtain, from the N query sample vector triples, a query sample vector triplet corresponding to the f-th query sample triplet, as a target query sample vector triplet;
the similarity determining subunit 1431 is further configured to determine, according to the service query sample vector and the labeling service positive sample vector included in the target query sample vector triplet, a first similarity between the f-th service query sample and the f-th labeling service positive sample;
The similarity determining subunit 1431 is further configured to determine, according to the service query sample vector in the target query sample vector triplet and each labeled service positive sample vector in the N query sample triplets, a second similarity between the f-th service query sample and each labeled service positive sample in the N query sample triplets;
the similarity determining subunit 1431 is further configured to determine a third similarity between the f-th service query sample and each labeled service negative sample in the N query sample triples according to the service query sample vector in the target query sample vector triplet and each labeled service negative sample vector in the N query sample triples;
the similarity determining subunit 1431 is further configured to perform weight adjustment on the N second similarities according to N positive weight parameters in the positive weight parameter set corresponding to the f-th query sample triplet, so as to obtain N first weight similarities;
the similarity determining subunit 1431 is further configured to perform weight adjustment on the N third similarities according to N negative weight parameters in the negative weight parameter set corresponding to the f-th query sample triplet, to obtain N second weight similarities;
the loss determination subunit 1432 is configured to determine a loss function value corresponding to the f-th service query sample according to the first similarity, the N first weight similarities, and the N second weight similarities.
For a specific implementation manner of the similarity determining subunit 1431 and the loss determining subunit 1432, reference may be made to the description of the steps S101 to S103 in the embodiment corresponding to fig. 3, which will not be repeated here.
The initial service query sample set comprises L unlabeled service query samples; l is a positive integer greater than or equal to M;
the first acquisition module 11 includes: the model acquisition unit 111 and the coarse prediction unit 112.
A model acquisition unit 111 for acquiring a target service standardization model; the target service standardization model is obtained by training an initial service standardization model based on H service class coarse clustering sets; h is a positive integer;
the coarse prediction unit 112 is configured to perform a service class coarse prediction process on the L unlabeled service query samples through the target service standardization model, obtain service class coarse prediction information corresponding to the L unlabeled service query samples, and use the service class coarse prediction information of the L samples as service class coarse prediction information corresponding to the initial service query sample set.
For a specific implementation manner of the model obtaining unit 111 and the coarse prediction unit 112, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, and the description will not be repeated here.
Wherein the above-mentioned quantity processing device 1 further comprises: a second acquisition module 15, a coarse screening module 16, a set construction module 17 and a second training module 18.
A second obtaining module 15, configured to obtain at least two initial service class samples;
the coarse clustering screening module 16 is configured to perform coarse clustering on at least two initial service class samples to obtain H service class coarse clustering sets;
the set construction module 17 is configured to construct P service class sample sets according to the H service class coarse cluster sets; p is a positive integer; the service class sample set comprises H service class sample pairs; the business category samples in the business category sample pair belonging to the same business category sample set respectively belong to different business category coarse clustering sets, and each business category sample pair comprises business category positive samples corresponding to the contained business category samples; the positive sample of a business class refers to a business class sample with the highest corresponding frequency in a business class coarse clustering set;
the second training module 18 is configured to perform iterative training on the initial service standardization model according to the P service class sample sets, so as to obtain a target service standardization model.
The specific implementation manners of the second obtaining module 15, the coarse screening module 16, the set constructing module 17 and the second training module 18 may be referred to the description of step S201 to step S204 in the embodiment corresponding to fig. 4, and will not be described herein.
Wherein the coarse clustering module 16 comprises: a deduplication unit 161, a high frequency determination unit 162, a filter update unit 163, and a coarse clustering unit 164.
A deduplication unit 161, configured to perform deduplication processing on at least two initial service class samples, so as to obtain X deduplication service class samples and frequency numbers corresponding to the X deduplication service class samples respectively; the frequency corresponding to one de-duplication service class sample is the same number of initial service class samples as one de-duplication service class sample in at least two initial service class samples; x is a positive integer;
a high frequency determining unit 162, configured to obtain, from the X duplicate-removal-service-class samples, duplicate-removal-service-class samples having a frequency number greater than or equal to a frequency number threshold, and obtain Y high frequency service-class samples; y is a positive integer less than or equal to X;
a filtering and updating unit 163, configured to perform filtering and updating processing on the Y high-frequency service class samples according to the standard service class dictionary, the stop word dictionary and the auxiliary service class dictionary, so as to obtain Z service class samples; z is a positive integer less than or equal to Y;
The coarse clustering unit 164 is configured to perform coarse clustering on the Z service class samples to obtain H service class coarse clustering sets.
The specific implementation manners of the deduplication unit 161, the high frequency determination unit 162, the filtering update unit 163, and the coarse clustering unit 164 may be referred to the description of step S202 in the embodiment corresponding to fig. 4, and will not be repeated here.
Wherein the filtering update unit 163 includes: dictionary matching sub-unit 1631 and update determination sub-unit 1632.
Dictionary matching sub-unit 1631, configured to traverse the Y high-frequency service class samples and sequentially obtain an e-th high-frequency service class sample; e is a positive integer less than or equal to Y;
dictionary matching subunit 1631, configured to match the e-th high-frequency traffic class sample with an auxiliary traffic class dictionary;
dictionary matching subunit 1631, further configured to match the e-th high-frequency business class sample with the deactivated word dictionary if the e-th high-frequency business class sample fails to match with the auxiliary business class dictionary;
the dictionary matching subunit 1631 is further configured to, if the e-th high-frequency traffic class sample is successfully matched with the deactivated word dictionary, use the deactivated word in the deactivated word dictionary that is matched with the e-th high-frequency traffic class sample as a target deactivated word, and remove the target deactivated word from the e-th high-frequency traffic class sample to obtain an e-th transition high-frequency traffic class sample;
Dictionary matching subunit 1631 is further configured to, if the e-th high-frequency traffic class sample fails to match with the deactivated word dictionary, use the e-th high-frequency traffic class sample as an e-th transition high-frequency traffic class sample;
dictionary matching subunit 1631, configured to match the e-th transition high-frequency traffic class sample with a standard traffic class dictionary;
dictionary matching subunit 1631, further configured to, if the e-th transition high-frequency traffic class sample fails to match with the standard traffic class dictionary, use the e-th transition high-frequency traffic class sample as a traffic class sample;
update determination subunit 1632 is configured to obtain Z traffic class samples when Y high-frequency traffic class samples have been traversed.
The specific implementation manner of the dictionary matching sub-unit 1631 and the update determination sub-unit 1632 may refer to the description of step S202 in the embodiment corresponding to fig. 4, and will not be described herein.
Wherein the coarse clustering unit 164 includes: a preprocessing subunit 1641, an iterative coarse clustering subunit 1642, and a coarse cluster set construction subunit 1643.
A preprocessing subunit 1641, configured to obtain B service class samples with the highest frequency among the Z service class samples, as B seed service class samples, and add the B seed service class samples to the initial seed service class sample set;
The preprocessing subunit 1641 is further configured to use a service class sample, from among the Z service class samples, from which the B service class samples are removed, as an iterative service class sample;
an iterative coarse clustering subunit 1642, configured to perform iterative coarse clustering on the initial seed service class sample set according to the iterative service class samples, to obtain a coarse clustered seed service class sample set; the coarse cluster seed service class sample set comprises H seed service class samples; each seed traffic class sample is associated with a similar traffic class sample;
the coarse cluster set construction subunit 1643 is configured to perform coarse cluster set construction according to each seed service class sample in the coarse cluster seed service class sample set and the similar service class samples associated with each seed service class sample, respectively, to obtain H service class coarse cluster sets.
For specific implementation manners of the preprocessing subunit 1641, the iterative coarse clustering subunit 1642, and the coarse clustering set constructing subunit 1643, reference may be made to the description of step S202 in the embodiment corresponding to fig. 4, and the details will not be repeated here.
The iterative coarse clustering subunit 1642 is specifically configured to sequentially obtain a d-th iterative traffic class sample from the iterative traffic class samples, where d is a positive integer less than or equal to Z-B; calculating the distance similarity between the d-th iterative business class sample and each seed business class sample in the d-1 round iterative seed business class sample set; d is 1, the iterative seed business class sample set of the d-1 round is the initial seed business class sample set; if the maximum distance similarity in the distance similarity is smaller than the target similarity threshold, adding the d-th iteration business category sample to the seed business category sample set of the d-1 th round to obtain the iteration seed business category sample set of the d round, and continuously obtaining the d+1th iteration business category sample to perform coarse clustering on the iteration seed business category sample set of the d round until the iteration seed business category sample set of the Z-B round is used as a coarse clustering seed business category sample set; if the maximum distance similarity in the distance similarity is greater than or equal to the target similarity threshold, determining the d iteration business class sample as a similar business class sample associated with the seed business class sample corresponding to the maximum distance similarity; taking the iterative seed business class sample set of the d-1 round as the iterative seed business class sample set of the d round, and continuously obtaining the (d+1) iterative business class sample to perform coarse clustering treatment on the iterative seed business class sample set of the d round until the iterative seed business class sample set of the Z-B round is taken as the coarse clustering seed business class sample set.
For a specific implementation manner of the iterative coarse clustering subunit 1642, reference may be made to the description of step S202 in the embodiment corresponding to fig. 4, and a detailed description will not be given here.
Wherein the second training module 18 comprises: a sample pair determination unit 181, an iterative encoding unit 182, an iterative loss determination unit 183, and an iterative training unit 184.
A sample pair determining unit 181, configured to sequentially obtain a q-th service class sample set from the P service class sample sets; q is a positive integer less than or equal to P;
the sample pair determining unit 181 is further configured to use H service class sample pairs in the q-th service class sample set as H service class sample pairs to be processed;
the iterative coding unit 182 is configured to perform feature coding processing on the H to-be-processed service class sample pairs through the q-1 th round of iterative service standardization model to obtain H service class vector pairs; a traffic class vector pair comprising a traffic class sample vector and a traffic class positive sample vector;
an iteration loss determining unit 183, configured to determine loss function values respectively corresponding to service class samples included in the H pairs of to-be-processed service class samples according to service class sample vectors and service class positive sample vectors in each pair of service class vectors, and negative sample relationships between different pairs of service class vectors;
The iteration training unit 184 is configured to train the q-1 th round of iteration service standardization model according to the H loss function values, so as to obtain the q-1 th round of iteration service standardization model;
the iteration training unit 184 is further configured to, if q is equal to H, use the iteration service standardization model of the q-th round as the target service standardization model;
the iteration training unit 184 is further configured to continuously obtain the (q+1) th service class sample set if q is smaller than H, and train the (q) th round of iteration service standardization model through the (q+1) th service class sample set.
The specific implementation manners of the sample pair determining unit 181, the iterative encoding unit 182, the iterative loss determining unit 183, and the iterative training unit 184 may be referred to the description of step S204 in the embodiment corresponding to fig. 4, and will not be described herein.
Wherein, the above-mentioned data processing apparatus 1, further include: the traffic guidance module 19.
A service guiding module 19, configured to obtain target service query information;
the service guiding module 19 is further configured to perform service guiding prediction processing on the target service query information through the target service guiding model, so as to obtain a target service class label corresponding to the target service query information;
The service guiding module 19 is further configured to obtain real service class information matched with the target service query information;
the service guiding module 19 is further configured to perform service class rough prediction processing on the real service class information through the target service standardization model, so as to obtain a real service class label corresponding to the real service class information;
the service guiding module 19 is further configured to generate real service guiding information according to the real service class information corresponding to the same real service class label as the target service class label, and display the real service guiding information.
For a specific implementation manner of the service guiding module 19, refer to the description of step S301 to step S305 in the embodiment corresponding to fig. 6, which will not be described herein.
Referring to fig. 9, fig. 9 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 9, the data processing apparatus 1 in the embodiment corresponding to fig. 8 described above may be applied to a computer device 1000, and the computer device 1000 may include: processor 1001, network interface 1004, and memory 1005, and in addition, the above-described computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display (Display), a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface, among others. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the processor 1001. As shown in fig. 9, an operating system, a network communication module, a user interface module, and a device control application may be included in a memory 1005, which is one type of computer-readable storage medium.
In the computer device 1000 shown in fig. 9, the network interface 1004 may provide a network communication network element; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
it should be understood that the computer device 1000 described in the embodiments of the present application may perform the description of the data processing method in any of the foregoing embodiments corresponding to any of fig. 3, which is not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.
Furthermore, it should be noted here that: the embodiments of the present application further provide a computer readable storage medium, where the aforementioned computer program executed by the data processing apparatus 1 is stored, and the aforementioned computer program includes program instructions, when executed by the aforementioned processor, can execute the description of the aforementioned data processing method in any of the corresponding embodiments of fig. 3, and therefore, a detailed description will not be given here. In addition, the description of the beneficial effects of the same method is omitted. For technical details not disclosed in the embodiments of the computer-readable storage medium according to the present application, please refer to the description of the method embodiments of the present application.
The computer readable storage medium may be the data processing apparatus provided in any one of the foregoing embodiments or an internal storage unit of the computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
Furthermore, it should be noted here that: embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium and executes the computer instructions to cause the computer device to perform the method provided by the corresponding embodiment of any of the preceding figures 3, 4.
The terms first, second and the like in the description and in the claims and drawings of the embodiments of the present application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied as electronic hardware, as a computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of network elements in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether these network elements are implemented in hardware or software depends on the specific application and design constraints of the solution. The skilled person may use different methods for implementing the described network elements for each specific application, but such implementation should not be considered beyond the scope of the present application.
The foregoing disclosure is only illustrative of the preferred embodiments of the present application and is not intended to limit the scope of the claims herein, as the equivalent of the claims herein shall be construed to fall within the scope of the claims herein.

Claims (18)

1. A method of data processing, comprising:
acquiring business category rough prediction information corresponding to an initial business query sample set, and respectively configuring a labeling business positive sample set and a labeling business negative sample set for M business query samples to be processed in the initial business query sample set according to the business category rough prediction information; m is a positive integer;
combining and pairing the M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets to obtain N query sample triples, and determining positive weight parameter sets and negative weight parameter sets respectively corresponding to the N query sample triples according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets; each query sample triplet comprises a service query sample belonging to the M service query samples to be processed, a labeling service positive sample belonging to the M labeling service positive sample sets, and a labeling service negative sample belonging to the M labeling service negative sample sets; n is a positive integer greater than or equal to M;
Training an initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model; the target service guide model is used for predicting service class labels corresponding to service inquiry information; each positive weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
2. The method of claim 1, wherein the initial set of service query samples comprises L unlabeled service query samples; l is a positive integer greater than or equal to M; the business category rough prediction information comprises sample business category rough prediction information respectively corresponding to the L unlabeled business inquiry samples;
the configuring a labeling service positive sample set and a labeling service negative sample set for M to-be-processed service query samples in the initial service query sample set according to the service category rough prediction information respectively includes:
Traversing sample business category rough prediction information corresponding to the L non-marked business query samples respectively, and sequentially acquiring sample business category rough prediction information corresponding to the ith non-marked business query sample as the ith sample business category rough prediction information; i is a positive integer less than or equal to L;
if the i sample business category rough prediction information comprises sample business category probability which is greater than or equal to a business category probability threshold value, adding the i unlabeled business query sample to a first prediction result sample set;
if the i sample business category rough prediction information does not contain the sample business category probability which is greater than or equal to the business category probability threshold value, adding the i unlabeled business query sample to a second prediction result sample set;
when L sample business category rough prediction information is traversed, M non-marked business query samples are obtained from the first prediction result sample set to serve as business query samples to be processed, A non-marked business query samples are obtained from the second prediction result sample set to serve as A difficult negative samples, and the A difficult negative samples are added to a difficult negative sample set; a is a positive integer, and the proportional relation between A and M meets the preset proportional condition;
And respectively configuring a labeling service positive sample set and a labeling service negative sample set for the M to-be-processed service query samples according to the difficult negative sample set and sample service category rough prediction information respectively corresponding to the M to-be-processed service query samples.
3. The method of claim 2, wherein the M pending service query samples comprise pending service query sample M j J is a positive integer less than or equal to M; the service query sample M to be processed j The corresponding sample business category rough prediction information comprises B sample business category rough prediction information pairs; the coarse prediction information pair of one sample service class comprises one sample service class and one service class probability; b is a positive integer;
the configuring a labeling service positive sample set and a labeling service negative sample set for the M service query samples according to the rough prediction information of the sample service category corresponding to the difficult negative sample set and the M service query samples respectively, includes:
creating the service query sample M to be processed j A corresponding initial positive sample set and initial negative sample set; the initial positive sample set and the initial negative sample set are empty sets;
Traversing the service query sample M to be processed j The corresponding B sample business category rough prediction information pairs sequentially acquire the kth sample business category and the kth sample business category probability; k is a positive integer less than or equal to B;
if the probability of the kth sample business category is greater than or equal to the business category probability threshold, sample matching is carried out on the kth sample business category according to positive and negative sample matching rules, and a sample matching result is obtained;
if the sample matching result of the kth sample service class is a positive sample result, taking the kth sample service class as the service query sample M to be processed j Corresponding marked service positive samples, and inquiring the service to be processed into a sample M j Adding the corresponding labeling service positive sample to the initial positive sample set;
if the sample matching result of the kth sample service class is a negative sample result, taking the kth sample service class as the service query sample M to be processed j Corresponding negative sample of marking service, and inquiring the service to be processed into sample M j Adding the corresponding negative sample of the labeling service to the initial negative sample set;
if the B sample business category rough prediction information pairs have been traversed and the initial negative sample set is an empty set, acquiring a difficult negative sample from the difficult negative sample set as the business query sample M to be processed j Corresponding negative sample of marking service, and inquiring the service to be processed into sample M j Adding the corresponding negative sample of the labeling service to the initial negative sample set;
the service inquiry sample M to be processed is added j The initial positive sample set of the corresponding marked service positive sample is determined to be the service query sample M to be processed j The corresponding marked service positive sample set adds the service query sample M to be processed j The initial negative sample set of the corresponding negative sample of the marked service is determined as the service query sample M to be processed j And the corresponding negative sample set of the labeling service.
4. The method according to claim 1, wherein the performing a combination pairing process on the M to-be-processed service query samples, the M labeling service positive sample sets, and the M labeling service negative sample sets to obtain N query sample triples, and determining a positive weight parameter set and a negative weight parameter set corresponding to the N query sample triples respectively according to the N query sample triples, the M labeling service positive sample sets, and the M labeling service negative sample sets includes:
combining and pairing the M to-be-processed service query samples, M labeling service positive sample sets and M labeling service negative sample sets to obtain N query sample triples;
Traversing the N query sample triples, and sequentially acquiring service query samples in the h query sample triples to serve as target service query samples; h is a positive integer less than or equal to N;
the labeling service positive sample set corresponding to the target service query sample is used as a target labeling service positive sample set, and the labeling service negative sample set corresponding to the target service query sample is used as a target labeling service negative sample set;
generating a positive weight parameter set corresponding to the target service query sample according to the similarity relation between the target labeling service positive sample set and labeling service positive samples respectively included by the N query sample triples;
and generating a negative weight parameter set corresponding to the target service query sample according to the similarity relation between the target labeling service negative sample set and the labeling service negative samples respectively included by the N query sample triples.
5. The method of claim 4, wherein the generating the positive weight parameter set corresponding to the target service query sample according to the similarity relationship between the target labeling service positive sample set and the labeling service positive samples respectively included in the N query sample triples includes:
Creating an initial positive weight parameter set; the initial positive weight parameter set is an empty set;
the number of the labeling service positive samples contained in the target labeling service positive sample set is used as the number of first positive samples;
traversing the marked service positive samples respectively included in the N inquiry sample triples, and sequentially obtaining a g marked service positive sample; g is a positive integer less than or equal to N;
the number of marking service positive samples which are different from the g marking service positive samples in the target marking service positive sample set is used as a second positive sample number;
determining weight parameters used for representing similarity relations between the target service query sample and the g-th marked service positive sample according to the first positive sample number and the second positive sample number;
adding a weight parameter between the target service query sample and the g-th marked service positive sample to the initial positive weight parameter set;
when N labeling service positive samples have been traversed, an initial positive weight parameter set containing weight parameters between the target service query sample and the N labeling service positive samples is used as a positive weight parameter set corresponding to the target service query sample.
6. The method of claim 1, wherein training the initial traffic steering model according to the N query sample triples, the N positive weight parameters in each positive weight parameter set, and the N negative weight parameters in each negative weight parameter set to obtain a target traffic steering model comprises:
performing feature coding processing on the N query sample triples through an initial service guide model to obtain query sample vector triples respectively corresponding to the N query sample triples; one query sample vector triplet comprises a service query sample vector, a labeling service positive sample vector and a labeling service negative sample vector;
traversing the N query sample triples, sequentially obtaining an f query sample triplet, taking a service query sample in the f query sample triplet as an f service query sample, and taking a labeling service positive sample in the f query sample triplet as an f labeling service positive sample;
determining a loss function value corresponding to an f-th service query sample according to the N query sample vector triples, the positive weight parameter set and the negative weight parameter set corresponding to the f-th query sample triples;
When the N query sample triples are traversed, model parameter adjustment is carried out on the initial service guide model according to the loss function values respectively corresponding to the N service query samples;
and if the adjusted initial service guide model meets the model convergence condition, taking the adjusted initial service guide model as a target service guide model.
7. The method of claim 6, wherein determining the loss function value for the f-th service query sample based on the N query sample vector triples, the positive weight parameter set and the negative weight parameter set for the f-th query sample triplet, comprises:
acquiring a query sample vector triplet corresponding to the f-th query sample triplet from the N query sample vector triples, and taking the query sample vector triplet as a target query sample vector triplet;
according to the service query sample vector and the labeling service positive sample vector included in the target query sample vector triplet, determining a first similarity between an f service query sample and an f labeling service positive sample;
determining a second similarity between an f-th service query sample and each marked service positive sample in the N query sample triples according to the service query sample vector in the target query sample vector triples and each marked service positive sample vector in the N query sample triples;
According to the service query sample vector in the target query sample vector triplet and each marked service negative sample vector in the N query sample triplets, determining a third similarity between an f-th service query sample and each marked service negative sample in the N query sample triplets;
according to N positive weight parameters in the positive weight parameter set corresponding to the f query sample triplet, carrying out weight adjustment on N second similarity to obtain N first weight similarity;
according to N negative weight parameters in the negative weight parameter set corresponding to the f query sample triplet, carrying out weight adjustment on N third weight similarities to obtain N second weight similarities;
and determining a loss function value corresponding to the f service query sample according to the first similarity, the N first weight similarities and the N second weight similarities.
8. The method of claim 1, wherein the initial set of service query samples comprises L unlabeled service query samples; l is a positive integer greater than or equal to M;
the obtaining the service category rough prediction information corresponding to the initial service inquiry sample set includes:
Obtaining a target service standardization model; the target service standardization model is obtained by training an initial service standardization model based on H service class coarse clustering sets; h is a positive integer;
and respectively carrying out business category rough prediction processing on the L unlabeled business query samples through the target business standardization model to obtain business category rough prediction information respectively corresponding to the L unlabeled business query samples, and taking the L sample business category rough prediction information as business category rough prediction information corresponding to the initial business query sample set.
9. The method as recited in claim 8, further comprising:
acquiring at least two initial service class samples;
screening the at least two initial business category samples for coarse clustering treatment to obtain H business category coarse clustering sets;
constructing P business category sample sets according to the H business category coarse clustering sets; p is a positive integer; the service class sample set comprises H service class sample pairs; the business category samples in the business category sample pair belonging to the same business category sample set respectively belong to different business category coarse clustering sets, and each business category sample pair comprises business category positive samples corresponding to the contained business category samples; the positive sample of a business class refers to a business class sample with the highest corresponding frequency in a business class coarse clustering set;
And carrying out iterative training on the initial service standardization model according to the P service class sample sets to obtain a target service standardization model.
10. The method of claim 9, wherein the filtering coarse clustering the at least two initial traffic class samples to obtain H traffic class coarse cluster sets comprises:
performing de-duplication processing on the at least two initial business category samples to obtain X de-duplication business category samples and frequency numbers corresponding to the X de-duplication business category samples respectively; the frequency corresponding to one de-duplication service class sample is the same number of initial service class samples as one de-duplication service class sample in the at least two initial service class samples; x is a positive integer;
obtaining duplicate-service-class samples with the frequency greater than or equal to a frequency threshold from the X duplicate-service-class samples, and obtaining Y high-frequency service-class samples; y is a positive integer less than or equal to X;
filtering and updating the Y high-frequency business class samples according to a standard business class dictionary, a stop word dictionary and an auxiliary business class dictionary to obtain Z business class samples; z is a positive integer less than or equal to Y;
And performing coarse clustering processing on the Z business category samples to obtain H business category coarse clustering sets.
11. The method of claim 10, wherein filtering and updating the Y high frequency traffic class samples according to the standard traffic class dictionary, the stop word dictionary, and the auxiliary traffic class dictionary to obtain Z traffic class samples comprises:
traversing the Y high-frequency service class samples, and sequentially acquiring an e-th high-frequency service class sample; e is a positive integer less than or equal to Y;
matching the e-th high-frequency business category sample with an auxiliary business category dictionary;
if the e-th high-frequency business class sample fails to be matched with the auxiliary business class dictionary, matching the e-th high-frequency business class sample with a deactivated word dictionary;
if the e-th high-frequency business class sample is successfully matched with the stop word dictionary, taking the stop word matched with the e-th high-frequency business class sample in the stop word dictionary as a target stop word, and removing the target stop word from the e-th high-frequency business class sample to obtain an e-th transition high-frequency business class sample;
if the e-th high-frequency business class sample fails to match with the deactivated word dictionary, taking the e-th high-frequency business class sample as an e-th transitional high-frequency business class sample;
Matching the e-th transition high-frequency business class sample with a standard business class dictionary;
if the e-th transition high-frequency business class sample fails to match with the standard business class dictionary, the e-th transition high-frequency business class sample is used as a business class sample;
and when the Y high-frequency service class samples are traversed, Z service class samples are obtained.
12. The method of claim 10, wherein the performing coarse clustering on the Z traffic class samples to obtain a set of H traffic class coarse clusters comprises:
b business class samples with highest frequency are obtained from the Z business class samples and used as B seed business class samples, and the B seed business class samples are added to an initial seed business class sample set;
removing business class samples of the B business class samples from the Z business class samples to serve as iterative business class samples;
performing iterative coarse clustering on the initial seed service class sample set according to the iterative service class samples to obtain a coarse clustering seed service class sample set; the coarse cluster seed service class sample set comprises H seed service class samples; each seed traffic class sample is associated with a similar traffic class sample;
And respectively constructing coarse clustering sets according to each seed business category sample in the coarse clustering seed business category sample set and similar business category samples associated with each seed business category sample to obtain H business category coarse clustering sets.
13. The method of claim 12, wherein performing iterative coarse clustering on the initial set of seed traffic class samples based on the iterative traffic class samples to obtain a coarse clustered set of seed traffic class samples, comprises:
sequentially acquiring a d-th iteration business class sample from the iteration business class samples, wherein d is a positive integer smaller than or equal to Z-B;
calculating the distance similarity between the d-th iterative business class sample and each seed business class sample in the d-1 round of iterative seed business class sample set; d is 1, the iterative seed business class sample set of the d-1 round is the initial seed business class sample set;
if the maximum distance similarity in the distance similarity is smaller than a target similarity threshold, adding the d-th iteration business category sample to the d-1-th round seed business category sample set to obtain a d-th round iteration seed business category sample set, and continuously obtaining d+1-th iteration business category samples to perform coarse clustering on the d-th round iteration seed business category sample set until the Z-B-th round iteration seed business category sample set is used as a coarse clustering seed business category sample set;
If the maximum distance similarity in the distance similarity is greater than or equal to a target similarity threshold, determining the d iteration business class sample as a similar business class sample associated with the seed business class sample corresponding to the maximum distance similarity;
taking the iterative seed business class sample set of the d-1 round as the iterative seed business class sample set of the d round, and continuously obtaining the (d+1) iterative business class sample to perform coarse clustering treatment on the iterative seed business class sample set of the d round until the iterative seed business class sample set of the Z-B round is taken as the coarse clustering seed business class sample set.
14. The method of claim 10, wherein iteratively training the initial traffic normalization model based on the P traffic class sample sets to obtain a target traffic normalization model, comprising:
sequentially acquiring a q-th business category sample set in the P business category sample sets; q is a positive integer less than or equal to P;
h service class sample pairs in the q-th service class sample set are used as H service class sample pairs to be processed;
performing feature coding processing on the H service class sample pairs to be processed through a q-1 th round of iterative service standardization model to obtain H service class vector pairs; a traffic class vector pair comprising a traffic class sample vector and a traffic class positive sample vector; when q=1, the iteration business standardization model of the q-1 th round is an initial business standardization model;
Determining loss function values respectively corresponding to service class samples included in the H service class sample pairs to be processed according to the service class sample vector and the service class positive sample vector in each service class vector pair and the negative sample relation between different service class vector pairs;
training the q-1 th round of iterative service standardization model according to the H loss function values to obtain the q-1 th round of iterative service standardization model;
if q is equal to H, the q-th round of iterative service standardization model is used as a target service standardization model;
if q is smaller than H, continuing to acquire a (q+1) th service class sample set, and training the iterative service standardization model of the (q) th round through the (q+1) th service class sample set.
15. The method as recited in claim 8, further comprising:
acquiring target service query information;
performing service guidance prediction processing on the target service inquiry information through the target service guidance model to obtain a target service class label corresponding to the target service inquiry information;
acquiring real business category information matched with the target business query information;
carrying out business category rough prediction processing on the real business category information through the target business standardization model to obtain a real business category label corresponding to the real business category information;
And generating real business guiding information according to the real business category information corresponding to the real business category label which is the same as the target business category label, and displaying the real business guiding information.
16. A data processing apparatus, comprising:
the first acquisition module is used for acquiring the business category rough prediction information corresponding to the initial business inquiry sample set;
the marking screening module is used for respectively configuring marking service positive sample sets and marking service negative sample sets for M to-be-processed service query samples in the initial service query sample sets according to the service category rough prediction information; m is a positive integer;
the sample processing module is used for carrying out combined pairing processing on the M to-be-processed service query samples, the M labeling service positive sample sets and the M labeling service negative sample sets to obtain N query sample triples, and determining positive weight parameter sets and negative weight parameter sets respectively corresponding to the N query sample triples according to the N query sample triples, the M labeling service positive sample sets and the M labeling service negative sample sets; each query sample triplet comprises a service query sample belonging to the M service query samples to be processed, a labeling service positive sample belonging to the M labeling service positive sample sets, and a labeling service negative sample belonging to the M labeling service negative sample sets; n is a positive integer greater than or equal to M;
The first training module is used for training the initial service guide model according to the N query sample triples, N positive weight parameters in each positive weight parameter set and N negative weight parameters in each negative weight parameter set to obtain a target service guide model; the target service guide model is used for predicting service class labels corresponding to service inquiry information; each positive weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service positive sample on the initial service guide model; each negative weight parameter is used for controlling the training influence of the similarity between a service query sample and a labeling service negative sample on the initial service guide model.
17. A computer device, comprising: a processor, a memory, and a network interface;
the processor is connected to the memory, the network interface for providing data communication functions, the memory for storing program code, the processor for invoking the program code to perform the method of any of claims 1-15.
18. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-15.
CN202211606585.2A 2022-12-12 2022-12-12 Data processing method, device, equipment and readable storage medium Active CN115858886B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211606585.2A CN115858886B (en) 2022-12-12 2022-12-12 Data processing method, device, equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211606585.2A CN115858886B (en) 2022-12-12 2022-12-12 Data processing method, device, equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN115858886A CN115858886A (en) 2023-03-28
CN115858886B true CN115858886B (en) 2024-02-27

Family

ID=85672840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211606585.2A Active CN115858886B (en) 2022-12-12 2022-12-12 Data processing method, device, equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115858886B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116701972B (en) * 2023-08-09 2023-11-24 腾讯科技(深圳)有限公司 Service data processing method, device, equipment and medium
CN116936058A (en) * 2023-09-14 2023-10-24 北京健康有益科技有限公司 Intelligent diagnosis guiding method and system based on deep learning and knowledge graph

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229555A (en) * 2017-12-29 2018-06-29 深圳云天励飞技术有限公司 Sample weights distribution method, model training method, electronic equipment and storage medium
CN109241366A (en) * 2018-07-18 2019-01-18 华南师范大学 A kind of mixed recommendation system and method based on multitask deep learning
CN110135459A (en) * 2019-04-15 2019-08-16 天津大学 A kind of zero sample classification method based on double triple depth measure learning networks
CN110598006A (en) * 2019-09-17 2019-12-20 南京医渡云医学技术有限公司 Model training method, triplet embedding method, apparatus, medium, and device
CN110796178A (en) * 2019-10-10 2020-02-14 支付宝(杭州)信息技术有限公司 Decision model training method, sample feature selection method, device and electronic equipment
CN113836885A (en) * 2020-06-24 2021-12-24 阿里巴巴集团控股有限公司 Text matching model training method, text matching device and electronic equipment
CN115130711A (en) * 2021-03-26 2022-09-30 腾讯科技(深圳)有限公司 Data processing method and device, computer and readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11605019B2 (en) * 2019-05-30 2023-03-14 Adobe Inc. Visually guided machine-learning language model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229555A (en) * 2017-12-29 2018-06-29 深圳云天励飞技术有限公司 Sample weights distribution method, model training method, electronic equipment and storage medium
CN109241366A (en) * 2018-07-18 2019-01-18 华南师范大学 A kind of mixed recommendation system and method based on multitask deep learning
CN110135459A (en) * 2019-04-15 2019-08-16 天津大学 A kind of zero sample classification method based on double triple depth measure learning networks
CN110598006A (en) * 2019-09-17 2019-12-20 南京医渡云医学技术有限公司 Model training method, triplet embedding method, apparatus, medium, and device
CN110796178A (en) * 2019-10-10 2020-02-14 支付宝(杭州)信息技术有限公司 Decision model training method, sample feature selection method, device and electronic equipment
CN113836885A (en) * 2020-06-24 2021-12-24 阿里巴巴集团控股有限公司 Text matching model training method, text matching device and electronic equipment
CN115130711A (en) * 2021-03-26 2022-09-30 腾讯科技(深圳)有限公司 Data processing method and device, computer and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Predicting gene function with positive and unlabeled examplesd;Yiming cheng 等;《2009 IEEE international conference on granular computing》;第1-2页 *
深度学习方法在兴趣点推荐中的应用研究综述;汤佳欣 等;《计算机工程》;第48卷(第1期);第12-15页 *

Also Published As

Publication number Publication date
CN115858886A (en) 2023-03-28

Similar Documents

Publication Publication Date Title
CN111316281B (en) Semantic classification method and system for numerical data in natural language context based on machine learning
US11232365B2 (en) Digital assistant platform
CN111538894B (en) Query feedback method and device, computer equipment and storage medium
CN115858886B (en) Data processing method, device, equipment and readable storage medium
CN112131350B (en) Text label determining method, device, terminal and readable storage medium
US9996604B2 (en) Generating usage report in a question answering system based on question categorization
US8793254B2 (en) Methods and apparatus for classifying content
CN112232065B (en) Method and device for mining synonyms
WO2023029506A1 (en) Illness state analysis method and apparatus, electronic device, and storage medium
US20220284174A1 (en) Correcting content generated by deep learning
CN109599187A (en) A kind of online interrogation point examines method, server, terminal, equipment and medium
WO2021120588A1 (en) Method and apparatus for language generation, computer device, and storage medium
US20220114346A1 (en) Multi case-based reasoning by syntactic-semantic alignment and discourse analysis
CN111881292B (en) Text classification method and device
CN114078597A (en) Decision trees with support from text for healthcare applications
US11532387B2 (en) Identifying information in plain text narratives EMRs
CN113657086B (en) Word processing method, device, equipment and storage medium
Jing et al. Knowledge-enhanced attentive learning for answer selection in community question answering systems
Li et al. Towards knowledge-based tourism Chinese question answering system
Saranya et al. Intelligent medical data storage system using machine learning approach
US11281855B1 (en) Reinforcement learning approach to decode sentence ambiguity
CN113094476A (en) Risk early warning method, system, equipment and medium based on natural language processing
CN113569018A (en) Question and answer pair mining method and device
Chen et al. Detecting the association of health problems in consumer-level medical text
US20210133627A1 (en) Methods and systems for confirming an advisory interaction with an artificial intelligence platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40088351

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant