CN113569293A

CN113569293A - Similar user acquisition method, system, electronic device and medium

Info

Publication number: CN113569293A
Application number: CN202110922782.4A
Authority: CN
Inventors: 姚娟娟; 钟南山
Original assignee: Mingpinyun Beijing Data Technology Co Ltd
Current assignee: Shanghai Mingping Medical Data Technology Co ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-10-29
Anticipated expiration: 2041-08-12
Also published as: CN113569293B

Abstract

The invention is suitable for the technical field of data processing, and provides a method, a system, electronic equipment and a medium for acquiring similar users, wherein the method comprises the following steps: acquiring labeled text data of a target field to form a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set; acquiring text data of a sample user, inputting the text data into a first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set; acquiring text data of a target user, inputting the text data into a first model, and acquiring an extraction result; desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users; the problem of the big difficulty of obtaining effective user data caused by the big user data volume is solved.

Description

Similar user acquisition method, system, electronic device and medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method, a system, an electronic device, and a medium for acquiring similar users.

Background

With the development of society and the rapid advance of technology, more and more platforms are used by users, and thus, the amount of data generated by users is also exponentially increased. However, due to the increase of the data volume, the difficulty of screening valid data from the massive data is also increased. In addition, data generated by the user also includes a lot of data related to user privacy, so that the user privacy data needs to be protected while effective data is screened out.

Disclosure of Invention

The invention provides a method, a system, electronic equipment and a medium for acquiring similar users, which aim to solve the problem of high difficulty in acquiring effective user data caused by large user data volume in the prior art.

The method for acquiring the similar users comprises the following steps:

the method comprises the steps of obtaining labeled text data of a target field, forming a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;

acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;

acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;

desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.

Optionally, the step of inputting the target desensitization data into the recognition model to obtain similar users specifically includes:

inputting the target desensitization data into the identification model to obtain a target user category;

obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of users to be matched;

and acquiring the similarity between the target user and the user to be matched, and acquiring similar users according to the similarity.

Optionally, the step of obtaining the similarity between the target user and the user to be matched specifically includes:

acquiring text data of the user to be matched, and performing feature extraction on the text data to obtain first feature data;

performing the feature extraction on the text data of the target user to obtain second feature data;

and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.

Optionally, the step of obtaining the similarity between the first feature data and the second feature data includes:

acquiring a characteristic category standard index;

and acquiring the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata.

Optionally, the second model is a convolutional neural network, and the third model is Softmax.

Optionally, the step of establishing an identification model for user category identification according to the second sample data set specifically includes:

inputting the second sample data set into the convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features.

Optionally, the step of obtaining text data of a sample user, inputting the text data of the sample user into the first model, and forming a second sample data set specifically includes:

acquiring text data of the sample user, and performing vectorization processing on the text data of the sample user to obtain a vector data set;

inputting the vector data set into the first model to obtain sample sensitive data;

and performing desensitization treatment on the text data of the sample user according to the sample sensitive data to form a second sample data set.

The invention also provides a system for acquiring similar users, which comprises:

the first model establishing module is used for acquiring labeled text data of a target field to form a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;

the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;

the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;

and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.

The present invention also provides an electronic device comprising: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the similar user acquisition method.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the similar user acquisition method as described above.

The invention has the beneficial effects that: according to the method for acquiring the similar users, a first model for sensitive word extraction is established according to the labeled text data, then an identification model for user category identification is established according to a second sample data set, the first model is adopted to extract the sensitive words from the text data of the target user, target desensitization data is acquired according to the extraction result, and the target desensitization data is input into the identification model, so that the similar users are acquired. Sensitive word extraction is carried out on the text data of the target user through the first model, so that the protection of the privacy data of the target user is realized; similar users are obtained by inputting the target desensitization data into the identification model, so that the similar users of the target users are obtained, and subsequent related schemes or related data recommendation is facilitated.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for acquiring similar users according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a method for obtaining similarity between a target user and a user to be matched according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of an acquisition system of similar users in the embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

In order to explain the technical means of the present invention, the following description will be given by way of specific examples.

In the field of medical health, due to the problems of medical resource shortage, medical resource uneven distribution, medical knowledge shortage, doctor-seeking concept shortage, doctor-patient relationship shortage and the like of the conventional medical service system, the online inquiry platform is rapidly developed and widely spread in recent years. However, the online inquiry often has the following problems: the private data of the user is leaked, so that the personal private information is stolen, the user is often difficult to accurately and comprehensively describe the self condition, and the question of the user cannot be answered in time. The invention provides a similar user acquisition method in order to identify the semantics and the purpose of a user, avoid the problems of long-time waiting of the user and leakage of privacy data of the user and simultaneously utilize an online inquiry platform to solve historical cases with high quality.

First embodiment

Fig. 1 is a flowchart illustrating a similar user obtaining method according to an embodiment of the present invention.

As shown in fig. 1, the similar user obtaining method includes steps S110 to S140:

s110, obtaining labeled text data of the target field to form a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;

s120, acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set;

s130, acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;

and S140, desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.

In step S110 of the present embodiment, the target field includes at least a medical field; taking the labeled text data in the medical field as an example, the labeled text data in the target field is derived from publicly available medical text data at home and abroad, such as periodicals, inquiry records, doctor orders, electronic medical records and the like in the medical field; various paper disease diagnosis records can be recorded by scanning or other modes to form labeled text data in the medical field. The method for acquiring the first sample data set comprises the following steps: pre-training the obtained labeled text data of the target field by adopting a word vector training model, and obtaining a first sample data set according to the pre-trained text data; the word vector pre-training model includes, but is not limited to, BERT. The first model is a named entity model, which includes, but is not limited to, BilSTM-CRF. And training the first model by adopting the first sample data set, and outputting the result as the user sensitive word. User sensitive words include, but are not limited to, identification number, name, address, cell phone number, bank card number, and zip code.

In step S120 of this embodiment, the text data of the sample user is derived from the electronic medical record and the medical text of the sample user. The step of obtaining the second sample data set comprises: obtaining text data of a sample user, and then carrying out vectorization processing on the text data to obtain a vector data set; inputting the vector data set into a first model to obtain sample sensitive data; and desensitizing the text data of the sample user according to the sample sensitive data to form a second sample data set. The word vector pre-training model may be used to vectorize the text data of the sample user. Desensitizing the text data of the sample user according to the sample sensitive data may be deleting the sample sensitive data in the text data of the sample user, or replacing the sample sensitive data in the text data of the sample user with characters. The method comprises the steps of inputting text data of a sample user into a first model for sensitive word extraction to obtain sample sensitive data, and then carrying out desensitization processing on the text data of the sample user according to the sample sensitive data, so that desensitization processing on the sample user sensitive words is realized, the problem of leakage of the privacy data of the sample user is effectively solved, and the problem that personal privacy information of the sample user is stolen is avoided.

In an embodiment, the recognition model comprises a second model for feature extraction and a third model for prediction of user classes. The second model is a convolutional neural network model, and the third model is Softmax. And inputting the second sample data set into a convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features. Specifically, disease category (user category) labeling is carried out on partial data in the second sample data set, then the second sample data set after labeling is adopted to train the convolutional neural network model and Softmax, the cross entropy loss function is adopted to determine the loss of final classification after forward propagation, the back propagation algorithm is adopted to carry out iterative updating on the weight until the loss value of the full convolutional neural network model tends to converge, and the identification model is obtained. The convolutional neural network comprises an input layer, a convolutional layer containing a plurality of convolution kernels, a pooling layer, two Dropout layers, two full-link layers and an output layer, and the activation function is a ReLu function.

In step S130 of this embodiment, the text data of the target user is derived from the electronic medical record or medical text of the target user. Inputting the acquired text data of the target user into a first model for extracting sensitive words, and acquiring an extraction result; and the extraction result is the sensitive words of the target user.

In step S140 of this embodiment, desensitization processing is performed on the text data of the target user according to the extraction result, and a desensitization processing mode is consistent with a desensitization processing mode performed on the text data of the sample user, which is not described herein again. Desensitizing the text data of the target user according to the extraction result, so that desensitizing of sensitive words of the target user is realized, the problem of leakage of privacy data of the target user is effectively prevented, and the problem that personal privacy information of the target user is stolen is avoided.

In one embodiment, the step of entering target desensitization data into a recognition model for similar users comprises: inputting the target desensitization data into the identification model to obtain a target user category; obtaining a plurality of corresponding sample users according to the category of a target user to obtain a plurality of users to be matched; and acquiring the similarity between the target user and the user to be matched, and acquiring the similar user according to the similarity between the target user and the user to be matched. The method has the advantages that the similar users corresponding to the target user can be obtained only by obtaining the similarity between the target user and the user to be matched, and the method of obtaining the similar users by obtaining the similarity between the target user and all sample users is avoided, so that invalid data processing is avoided, and the processing efficiency of medical data is improved.

In an embodiment, please refer to fig. 2 for a specific implementation method for obtaining the similarity between the target user and the user to be matched, and fig. 2 is a schematic flow chart of the method for obtaining the similarity between the target user and the user to be matched according to an embodiment of the present invention.

As shown in fig. 2, obtaining the similarity between the target user and the user to be matched may include the following steps S210 to S230:

s210, acquiring text data of a user to be matched, and performing feature extraction on the text data to obtain first feature data;

s220, performing feature extraction on the text data of the target user to obtain second feature data;

and S230, acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.

In an embodiment, before step S210, the method further includes: according to the user category of the user to be matched and the user category of the target user, disease evaluation data corresponding to the user category (disease category) is obtained, feature extraction is carried out on the text data of the user to be matched and the text data of the target user, namely, disease evaluation data (first feature data and second feature data) in the text data of the user to be matched and the text data of the target user are extracted, and the disease evaluation data comprises but is not limited to heart rate data, blood pressure data, blood sugar data and blood oxygen saturation data. The first characteristic data comprises a plurality of first characteristic subdata and a plurality of characteristic categories, the second characteristic data comprises a plurality of first characteristic subdata and a plurality of characteristic categories, the number of the first characteristic subdata is the same as that of the second characteristic subdata, and the characteristic categories of the first characteristic data and the second characteristic data are also the same. Characteristic categories include, but are not limited to, heart rate, blood pressure, blood glucose, blood oxygen saturation.

In an embodiment, the obtaining the similarity between the first feature data and the second feature data includes: acquiring a characteristic category standard index; and obtaining the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata. The characteristic category criteria indicators may be obtained directly from published medical data.

Specifically, the mathematical expression of the similarity of the first feature data and the second feature data is:

wherein S (X1, X2) is the similarity of the first feature data and the second feature data; x1 is first feature data, X2 is second feature data, n is the number of feature classes corresponding to the first feature data and the second feature data, i is the mark number of the feature class, a_iIs the first characteristic subdata corresponding to the characteristic category i, b_iIs the second characteristic subdata l corresponding to the characteristic category i_iIs a characteristic class standard index, m, corresponding to the characteristic class i_iAnd the preset feature class weight corresponding to the feature class i. The preset feature category weight can be determined according to the user categories corresponding to the target user and the user to be matched; specifically. The preset feature category weight may be determined according to a correlation between a feature category corresponding to the user category and a disease corresponding to the user category, where the correlation is positively correlated with the setting of the preset feature category weight value. The stronger the correlation, the greater the preset feature class weight. The relevance of the feature categories corresponding to the user categories and the diseases corresponding to the user categories can be directly obtained according to public medical knowledge.

In an embodiment, the maximum similarity between the target user and all the users to be matched is determined from the similar users according to the similarity between the target user and the users to be matched, and the user to be matched corresponding to the maximum similarity is taken as the similar user. By confirming the similar users, the electronic medical records or medical texts of the similar users are fed back to the target user, so that the long-time waiting of the users can be avoided, the electronic medical records or medical texts of the similar users are fed back to a clinician, a diagnosis reference is provided for the clinician, and the diagnosis efficiency of the clinician can be improved.

Second embodiment

Based on the same inventive concept as the method in the first embodiment, correspondingly, the embodiment also provides an acquisition system of similar users.

Fig. 3 is a schematic flow chart of an acquisition system for similar users according to the present invention.

As shown in fig. 3, the system 3 shown comprises: 31 a first model building module, 32 a recognition model building module, 33 an extraction result obtaining module and 34 a similar user obtaining module.

The first model establishing module is used for acquiring labeled text data of the target field, forming a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;

In some exemplary embodiments, the similar user acquisition module comprises:

the user category acquisition unit is used for inputting the target desensitization data into the identification model to obtain a target user category;

the to-be-matched user obtaining unit is used for obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of to-be-matched users;

the similar user acquisition unit is used for acquiring the similarity between the target user and the user to be matched and acquiring the similar user according to the similarity between the target user and the user to be matched;

acquiring text data of a user to be matched, and performing feature extraction on the text data to obtain first feature data;

performing feature extraction on the text data of the target user to obtain second feature data;

In some exemplary embodiments, the similar user acquiring unit includes:

the index acquisition unit is used for acquiring a characteristic category standard index;

and the similarity obtaining subunit is configured to obtain a similarity between the first feature data and the second feature data according to a preset feature category weight, a feature category standard index, the first feature sub-data, and the second feature sub-data.

In some exemplary embodiments, the recognition model building module comprises:

and inputting the second sample data set into a second model, extracting the features of the second sample data set, inputting the extracted features into a third model, and classifying the extracted features.

The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.

The present embodiment also provides an electronic device, including: a processor and a memory;

the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the method in the embodiment.

The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.

In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In the above-described embodiments, references in the specification to "the present embodiment," "an embodiment," "another embodiment," "an example embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The multiple occurrences of "the present embodiment," "one embodiment," "another embodiment," "an example embodiment," do not necessarily all refer to the same embodiment.

In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for obtaining similar users, the method comprising:

2. The method for acquiring similar users according to claim 1, wherein the step of inputting the target desensitization data into the recognition model to acquire similar users specifically comprises:

and acquiring the similarity between a target user and a user to be matched, and acquiring a similar user according to the similarity between the target user and the user to be matched.

3. The method for acquiring similar users according to claim 2, wherein the step of acquiring the similarity between the target user and the user to be matched specifically comprises:

4. The method for obtaining similar users according to claim 3, wherein the first feature data includes a feature category and first feature sub-data, the second feature data includes a feature category and second feature sub-data, and the step of obtaining the similarity between the first feature data and the second feature data specifically includes:

acquiring a characteristic category standard index;

5. The similar user acquisition method according to claim 1, wherein the second model is a convolutional neural network, and the third model is Softmax.

6. The method for acquiring similar users according to claim 5, wherein the step of establishing an identification model for user category identification according to the second sample data set specifically includes:

7. The method according to claim 1, wherein the step of obtaining the text data of the sample user, inputting the text data of the sample user into the first model, and forming a second sample data set specifically includes:

8. A similar user acquisition system, the system comprising:

9. An electronic device comprising a processor, a memory, and a communication bus;

the communication bus is used for connecting the processor and the memory;

the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.