CN113569293A - Similar user acquisition method, system, electronic device and medium - Google Patents
Similar user acquisition method, system, electronic device and medium Download PDFInfo
- Publication number
- CN113569293A CN113569293A CN202110922782.4A CN202110922782A CN113569293A CN 113569293 A CN113569293 A CN 113569293A CN 202110922782 A CN202110922782 A CN 202110922782A CN 113569293 A CN113569293 A CN 113569293A
- Authority
- CN
- China
- Prior art keywords
- data
- user
- model
- sample
- acquiring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000000605 extraction Methods 0.000 claims abstract description 48
- 238000000586 desensitisation Methods 0.000 claims abstract description 31
- 238000012545 processing Methods 0.000 claims abstract description 14
- 238000004590 computer program Methods 0.000 claims description 12
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 238000004891 communication Methods 0.000 claims description 8
- 230000000875 corresponding effect Effects 0.000 description 16
- 201000010099 disease Diseases 0.000 description 8
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 8
- 238000012549 training Methods 0.000 description 5
- 239000008280 blood Substances 0.000 description 4
- 210000004369 blood Anatomy 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000004048 modification Effects 0.000 description 4
- 238000003745 diagnosis Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- QVGXLLKOCUKJST-UHFFFAOYSA-N atomic oxygen Chemical compound [O] QVGXLLKOCUKJST-UHFFFAOYSA-N 0.000 description 2
- 230000036772 blood pressure Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 229910052760 oxygen Inorganic materials 0.000 description 2
- 239000001301 oxygen Substances 0.000 description 2
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000008103 glucose Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/211—Selection of the most significant subset of features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2415—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Public Health (AREA)
- Computing Systems (AREA)
- Bioethics (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Primary Health Care (AREA)
- Computer Security & Cryptography (AREA)
- Computer Hardware Design (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention is suitable for the technical field of data processing, and provides a method, a system, electronic equipment and a medium for acquiring similar users, wherein the method comprises the following steps: acquiring labeled text data of a target field to form a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set; acquiring text data of a sample user, inputting the text data into a first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set; acquiring text data of a target user, inputting the text data into a first model, and acquiring an extraction result; desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users; the problem of the big difficulty of obtaining effective user data caused by the big user data volume is solved.
Description
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, a system, an electronic device, and a medium for acquiring similar users.
Background
With the development of society and the rapid advance of technology, more and more platforms are used by users, and thus, the amount of data generated by users is also exponentially increased. However, due to the increase of the data volume, the difficulty of screening valid data from the massive data is also increased. In addition, data generated by the user also includes a lot of data related to user privacy, so that the user privacy data needs to be protected while effective data is screened out.
Disclosure of Invention
The invention provides a method, a system, electronic equipment and a medium for acquiring similar users, which aim to solve the problem of high difficulty in acquiring effective user data caused by large user data volume in the prior art.
The method for acquiring the similar users comprises the following steps:
the method comprises the steps of obtaining labeled text data of a target field, forming a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;
acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;
desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.
Optionally, the step of inputting the target desensitization data into the recognition model to obtain similar users specifically includes:
inputting the target desensitization data into the identification model to obtain a target user category;
obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of users to be matched;
and acquiring the similarity between the target user and the user to be matched, and acquiring similar users according to the similarity.
Optionally, the step of obtaining the similarity between the target user and the user to be matched specifically includes:
acquiring text data of the user to be matched, and performing feature extraction on the text data to obtain first feature data;
performing the feature extraction on the text data of the target user to obtain second feature data;
and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
Optionally, the step of obtaining the similarity between the first feature data and the second feature data includes:
acquiring a characteristic category standard index;
and acquiring the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata.
Optionally, the second model is a convolutional neural network, and the third model is Softmax.
Optionally, the step of establishing an identification model for user category identification according to the second sample data set specifically includes:
inputting the second sample data set into the convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features.
Optionally, the step of obtaining text data of a sample user, inputting the text data of the sample user into the first model, and forming a second sample data set specifically includes:
acquiring text data of the sample user, and performing vectorization processing on the text data of the sample user to obtain a vector data set;
inputting the vector data set into the first model to obtain sample sensitive data;
and performing desensitization treatment on the text data of the sample user according to the sample sensitive data to form a second sample data set.
The invention also provides a system for acquiring similar users, which comprises:
the first model establishing module is used for acquiring labeled text data of a target field to form a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;
the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;
and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.
The present invention also provides an electronic device comprising: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the similar user acquisition method.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the similar user acquisition method as described above.
The invention has the beneficial effects that: according to the method for acquiring the similar users, a first model for sensitive word extraction is established according to the labeled text data, then an identification model for user category identification is established according to a second sample data set, the first model is adopted to extract the sensitive words from the text data of the target user, target desensitization data is acquired according to the extraction result, and the target desensitization data is input into the identification model, so that the similar users are acquired. Sensitive word extraction is carried out on the text data of the target user through the first model, so that the protection of the privacy data of the target user is realized; similar users are obtained by inputting the target desensitization data into the identification model, so that the similar users of the target users are obtained, and subsequent related schemes or related data recommendation is facilitated.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for acquiring similar users according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for obtaining similarity between a target user and a user to be matched according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of an acquisition system of similar users in the embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
In the field of medical health, due to the problems of medical resource shortage, medical resource uneven distribution, medical knowledge shortage, doctor-seeking concept shortage, doctor-patient relationship shortage and the like of the conventional medical service system, the online inquiry platform is rapidly developed and widely spread in recent years. However, the online inquiry often has the following problems: the private data of the user is leaked, so that the personal private information is stolen, the user is often difficult to accurately and comprehensively describe the self condition, and the question of the user cannot be answered in time. The invention provides a similar user acquisition method in order to identify the semantics and the purpose of a user, avoid the problems of long-time waiting of the user and leakage of privacy data of the user and simultaneously utilize an online inquiry platform to solve historical cases with high quality.
First embodiment
Fig. 1 is a flowchart illustrating a similar user obtaining method according to an embodiment of the present invention.
As shown in fig. 1, the similar user obtaining method includes steps S110 to S140:
s110, obtaining labeled text data of the target field to form a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;
s120, acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set;
s130, acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;
and S140, desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.
In step S110 of the present embodiment, the target field includes at least a medical field; taking the labeled text data in the medical field as an example, the labeled text data in the target field is derived from publicly available medical text data at home and abroad, such as periodicals, inquiry records, doctor orders, electronic medical records and the like in the medical field; various paper disease diagnosis records can be recorded by scanning or other modes to form labeled text data in the medical field. The method for acquiring the first sample data set comprises the following steps: pre-training the obtained labeled text data of the target field by adopting a word vector training model, and obtaining a first sample data set according to the pre-trained text data; the word vector pre-training model includes, but is not limited to, BERT. The first model is a named entity model, which includes, but is not limited to, BilSTM-CRF. And training the first model by adopting the first sample data set, and outputting the result as the user sensitive word. User sensitive words include, but are not limited to, identification number, name, address, cell phone number, bank card number, and zip code.
In step S120 of this embodiment, the text data of the sample user is derived from the electronic medical record and the medical text of the sample user. The step of obtaining the second sample data set comprises: obtaining text data of a sample user, and then carrying out vectorization processing on the text data to obtain a vector data set; inputting the vector data set into a first model to obtain sample sensitive data; and desensitizing the text data of the sample user according to the sample sensitive data to form a second sample data set. The word vector pre-training model may be used to vectorize the text data of the sample user. Desensitizing the text data of the sample user according to the sample sensitive data may be deleting the sample sensitive data in the text data of the sample user, or replacing the sample sensitive data in the text data of the sample user with characters. The method comprises the steps of inputting text data of a sample user into a first model for sensitive word extraction to obtain sample sensitive data, and then carrying out desensitization processing on the text data of the sample user according to the sample sensitive data, so that desensitization processing on the sample user sensitive words is realized, the problem of leakage of the privacy data of the sample user is effectively solved, and the problem that personal privacy information of the sample user is stolen is avoided.
In an embodiment, the recognition model comprises a second model for feature extraction and a third model for prediction of user classes. The second model is a convolutional neural network model, and the third model is Softmax. And inputting the second sample data set into a convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features. Specifically, disease category (user category) labeling is carried out on partial data in the second sample data set, then the second sample data set after labeling is adopted to train the convolutional neural network model and Softmax, the cross entropy loss function is adopted to determine the loss of final classification after forward propagation, the back propagation algorithm is adopted to carry out iterative updating on the weight until the loss value of the full convolutional neural network model tends to converge, and the identification model is obtained. The convolutional neural network comprises an input layer, a convolutional layer containing a plurality of convolution kernels, a pooling layer, two Dropout layers, two full-link layers and an output layer, and the activation function is a ReLu function.
In step S130 of this embodiment, the text data of the target user is derived from the electronic medical record or medical text of the target user. Inputting the acquired text data of the target user into a first model for extracting sensitive words, and acquiring an extraction result; and the extraction result is the sensitive words of the target user.
In step S140 of this embodiment, desensitization processing is performed on the text data of the target user according to the extraction result, and a desensitization processing mode is consistent with a desensitization processing mode performed on the text data of the sample user, which is not described herein again. Desensitizing the text data of the target user according to the extraction result, so that desensitizing of sensitive words of the target user is realized, the problem of leakage of privacy data of the target user is effectively prevented, and the problem that personal privacy information of the target user is stolen is avoided.
In one embodiment, the step of entering target desensitization data into a recognition model for similar users comprises: inputting the target desensitization data into the identification model to obtain a target user category; obtaining a plurality of corresponding sample users according to the category of a target user to obtain a plurality of users to be matched; and acquiring the similarity between the target user and the user to be matched, and acquiring the similar user according to the similarity between the target user and the user to be matched. The method has the advantages that the similar users corresponding to the target user can be obtained only by obtaining the similarity between the target user and the user to be matched, and the method of obtaining the similar users by obtaining the similarity between the target user and all sample users is avoided, so that invalid data processing is avoided, and the processing efficiency of medical data is improved.
In an embodiment, please refer to fig. 2 for a specific implementation method for obtaining the similarity between the target user and the user to be matched, and fig. 2 is a schematic flow chart of the method for obtaining the similarity between the target user and the user to be matched according to an embodiment of the present invention.
As shown in fig. 2, obtaining the similarity between the target user and the user to be matched may include the following steps S210 to S230:
s210, acquiring text data of a user to be matched, and performing feature extraction on the text data to obtain first feature data;
s220, performing feature extraction on the text data of the target user to obtain second feature data;
and S230, acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
In an embodiment, before step S210, the method further includes: according to the user category of the user to be matched and the user category of the target user, disease evaluation data corresponding to the user category (disease category) is obtained, feature extraction is carried out on the text data of the user to be matched and the text data of the target user, namely, disease evaluation data (first feature data and second feature data) in the text data of the user to be matched and the text data of the target user are extracted, and the disease evaluation data comprises but is not limited to heart rate data, blood pressure data, blood sugar data and blood oxygen saturation data. The first characteristic data comprises a plurality of first characteristic subdata and a plurality of characteristic categories, the second characteristic data comprises a plurality of first characteristic subdata and a plurality of characteristic categories, the number of the first characteristic subdata is the same as that of the second characteristic subdata, and the characteristic categories of the first characteristic data and the second characteristic data are also the same. Characteristic categories include, but are not limited to, heart rate, blood pressure, blood glucose, blood oxygen saturation.
In an embodiment, the obtaining the similarity between the first feature data and the second feature data includes: acquiring a characteristic category standard index; and obtaining the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata. The characteristic category criteria indicators may be obtained directly from published medical data.
Specifically, the mathematical expression of the similarity of the first feature data and the second feature data is:
wherein S (X1, X2) is the similarity of the first feature data and the second feature data; x1 is first feature data, X2 is second feature data, n is the number of feature classes corresponding to the first feature data and the second feature data, i is the mark number of the feature class, aiIs the first characteristic subdata corresponding to the characteristic category i, biIs the second characteristic subdata l corresponding to the characteristic category iiIs a characteristic class standard index, m, corresponding to the characteristic class iiAnd the preset feature class weight corresponding to the feature class i. The preset feature category weight can be determined according to the user categories corresponding to the target user and the user to be matched; specifically. The preset feature category weight may be determined according to a correlation between a feature category corresponding to the user category and a disease corresponding to the user category, where the correlation is positively correlated with the setting of the preset feature category weight value. The stronger the correlation, the greater the preset feature class weight. The relevance of the feature categories corresponding to the user categories and the diseases corresponding to the user categories can be directly obtained according to public medical knowledge.
In an embodiment, the maximum similarity between the target user and all the users to be matched is determined from the similar users according to the similarity between the target user and the users to be matched, and the user to be matched corresponding to the maximum similarity is taken as the similar user. By confirming the similar users, the electronic medical records or medical texts of the similar users are fed back to the target user, so that the long-time waiting of the users can be avoided, the electronic medical records or medical texts of the similar users are fed back to a clinician, a diagnosis reference is provided for the clinician, and the diagnosis efficiency of the clinician can be improved.
Second embodiment
Based on the same inventive concept as the method in the first embodiment, correspondingly, the embodiment also provides an acquisition system of similar users.
Fig. 3 is a schematic flow chart of an acquisition system for similar users according to the present invention.
As shown in fig. 3, the system 3 shown comprises: 31 a first model building module, 32 a recognition model building module, 33 an extraction result obtaining module and 34 a similar user obtaining module.
The first model establishing module is used for acquiring labeled text data of the target field, forming a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;
the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;
and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.
In some exemplary embodiments, the similar user acquisition module comprises:
the user category acquisition unit is used for inputting the target desensitization data into the identification model to obtain a target user category;
the to-be-matched user obtaining unit is used for obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of to-be-matched users;
the similar user acquisition unit is used for acquiring the similarity between the target user and the user to be matched and acquiring the similar user according to the similarity between the target user and the user to be matched;
acquiring text data of a user to be matched, and performing feature extraction on the text data to obtain first feature data;
performing feature extraction on the text data of the target user to obtain second feature data;
and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
In some exemplary embodiments, the similar user acquiring unit includes:
the index acquisition unit is used for acquiring a characteristic category standard index;
and the similarity obtaining subunit is configured to obtain a similarity between the first feature data and the second feature data according to a preset feature category weight, a feature category standard index, the first feature sub-data, and the second feature sub-data.
In some exemplary embodiments, the recognition model building module comprises:
and inputting the second sample data set into a second model, extracting the features of the second sample data set, inputting the extracted features into a third model, and classifying the extracted features.
The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.
The present embodiment also provides an electronic device, including: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the method in the embodiment.
The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.
In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In the above-described embodiments, references in the specification to "the present embodiment," "an embodiment," "another embodiment," "an example embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The multiple occurrences of "the present embodiment," "one embodiment," "another embodiment," "an example embodiment," do not necessarily all refer to the same embodiment.
In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.
Claims (10)
1. A method for obtaining similar users, the method comprising:
the method comprises the steps of obtaining labeled text data of a target field, forming a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;
acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;
desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.
2. The method for acquiring similar users according to claim 1, wherein the step of inputting the target desensitization data into the recognition model to acquire similar users specifically comprises:
inputting the target desensitization data into the identification model to obtain a target user category;
obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of users to be matched;
and acquiring the similarity between a target user and a user to be matched, and acquiring a similar user according to the similarity between the target user and the user to be matched.
3. The method for acquiring similar users according to claim 2, wherein the step of acquiring the similarity between the target user and the user to be matched specifically comprises:
acquiring text data of the user to be matched, and performing feature extraction on the text data to obtain first feature data;
performing the feature extraction on the text data of the target user to obtain second feature data;
and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
4. The method for obtaining similar users according to claim 3, wherein the first feature data includes a feature category and first feature sub-data, the second feature data includes a feature category and second feature sub-data, and the step of obtaining the similarity between the first feature data and the second feature data specifically includes:
acquiring a characteristic category standard index;
and acquiring the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata.
5. The similar user acquisition method according to claim 1, wherein the second model is a convolutional neural network, and the third model is Softmax.
6. The method for acquiring similar users according to claim 5, wherein the step of establishing an identification model for user category identification according to the second sample data set specifically includes:
inputting the second sample data set into the convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features.
7. The method according to claim 1, wherein the step of obtaining the text data of the sample user, inputting the text data of the sample user into the first model, and forming a second sample data set specifically includes:
acquiring text data of the sample user, and performing vectorization processing on the text data of the sample user to obtain a vector data set;
inputting the vector data set into the first model to obtain sample sensitive data;
and performing desensitization treatment on the text data of the sample user according to the sample sensitive data to form a second sample data set.
8. A similar user acquisition system, the system comprising:
the first model establishing module is used for acquiring labeled text data of a target field to form a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;
the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;
and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.
9. An electronic device comprising a processor, a memory, and a communication bus;
the communication bus is used for connecting the processor and the memory;
the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110922782.4A CN113569293B (en) | 2021-08-12 | 2021-08-12 | Similar user acquisition method, system, electronic equipment and medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110922782.4A CN113569293B (en) | 2021-08-12 | 2021-08-12 | Similar user acquisition method, system, electronic equipment and medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113569293A true CN113569293A (en) | 2021-10-29 |
CN113569293B CN113569293B (en) | 2024-06-07 |
Family
ID=78171483
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110922782.4A Active CN113569293B (en) | 2021-08-12 | 2021-08-12 | Similar user acquisition method, system, electronic equipment and medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113569293B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110807207A (en) * | 2019-10-30 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110909224A (en) * | 2019-11-22 | 2020-03-24 | 浙江大学 | Sensitive data automatic classification and identification method and system based on artificial intelligence |
US20200097910A1 (en) * | 2017-06-12 | 2020-03-26 | Sensory Technologies Of Canada Inc. | A system for generating a record of community-based patient care |
CN111143884A (en) * | 2019-12-31 | 2020-05-12 | 北京懿医云科技有限公司 | Data desensitization method and device, electronic equipment and storage medium |
CN112071425A (en) * | 2020-09-04 | 2020-12-11 | 平安科技(深圳)有限公司 | Data processing method and device, computer equipment and storage medium |
US20200402625A1 (en) * | 2019-06-21 | 2020-12-24 | nference, inc. | Systems and methods for computing with private healthcare data |
CN112329055A (en) * | 2020-11-02 | 2021-02-05 | 微医云(杭州)控股有限公司 | Method and device for desensitizing user data, electronic equipment and storage medium |
CN112784008A (en) * | 2020-07-16 | 2021-05-11 | 上海芯翌智能科技有限公司 | Case similarity determining method and device, storage medium and terminal |
CN113127605A (en) * | 2021-06-17 | 2021-07-16 | 明品云(北京)数据科技有限公司 | Method and system for establishing target recognition model, electronic equipment and medium |
CN113160999A (en) * | 2021-04-25 | 2021-07-23 | 厦门拜特信息科技有限公司 | Data structured analysis system and data processing method for medical decision |
CN113221747A (en) * | 2021-05-13 | 2021-08-06 | 支付宝(杭州)信息技术有限公司 | Privacy data processing method, device and equipment based on privacy protection |
-
2021
- 2021-08-12 CN CN202110922782.4A patent/CN113569293B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200097910A1 (en) * | 2017-06-12 | 2020-03-26 | Sensory Technologies Of Canada Inc. | A system for generating a record of community-based patient care |
US20200402625A1 (en) * | 2019-06-21 | 2020-12-24 | nference, inc. | Systems and methods for computing with private healthcare data |
CN110807207A (en) * | 2019-10-30 | 2020-02-18 | 腾讯科技(深圳)有限公司 | Data processing method and device, electronic equipment and storage medium |
CN110909224A (en) * | 2019-11-22 | 2020-03-24 | 浙江大学 | Sensitive data automatic classification and identification method and system based on artificial intelligence |
CN111143884A (en) * | 2019-12-31 | 2020-05-12 | 北京懿医云科技有限公司 | Data desensitization method and device, electronic equipment and storage medium |
CN112784008A (en) * | 2020-07-16 | 2021-05-11 | 上海芯翌智能科技有限公司 | Case similarity determining method and device, storage medium and terminal |
CN112071425A (en) * | 2020-09-04 | 2020-12-11 | 平安科技(深圳)有限公司 | Data processing method and device, computer equipment and storage medium |
CN112329055A (en) * | 2020-11-02 | 2021-02-05 | 微医云(杭州)控股有限公司 | Method and device for desensitizing user data, electronic equipment and storage medium |
CN113160999A (en) * | 2021-04-25 | 2021-07-23 | 厦门拜特信息科技有限公司 | Data structured analysis system and data processing method for medical decision |
CN113221747A (en) * | 2021-05-13 | 2021-08-06 | 支付宝(杭州)信息技术有限公司 | Privacy data processing method, device and equipment based on privacy protection |
CN113127605A (en) * | 2021-06-17 | 2021-07-16 | 明品云(北京)数据科技有限公司 | Method and system for establishing target recognition model, electronic equipment and medium |
Non-Patent Citations (3)
Title |
---|
YONG MA 等: "Quantitative evaluation model of desensitization algorithm", 2021 IEEE ASIA-PACIFIC CONFERENCE ON IMAGE PROCESSING, ELECTRONICS AND COMPUTERS (IPEC), 7 May 2021 (2021-05-07) * |
张佳影: "区域医疗健康平台中检验检查指标的标准化算法", 武汉大学学报(理学版), vol. 56, no. 09, 30 September 2019 (2019-09-30) * |
郭进京;张雪;林鑫;任慧玲;: "国内患者隐私泄露情形及隐私保护现状分析", 医学信息学杂志, no. 02, 25 February 2020 (2020-02-25) * |
Also Published As
Publication number | Publication date |
---|---|
CN113569293B (en) | 2024-06-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20190392258A1 (en) | Method and apparatus for generating information | |
CN106874253A (en) | Recognize the method and device of sensitive information | |
CN113127605B (en) | Method and system for establishing target recognition model, electronic equipment and medium | |
CN111600874B (en) | User account detection method and device, electronic equipment and medium | |
CN113095408A (en) | Risk determination method and device and server | |
CN113569933A (en) | Trademark pattern matching method and corresponding device, equipment and medium | |
CN112070550A (en) | Keyword determination method, device and equipment based on search platform and storage medium | |
CN118227771B (en) | Knowledge question-answering processing method and device | |
CN114398557A (en) | Information recommendation method and device based on double portraits, electronic equipment and storage medium | |
CN114417883B (en) | Data processing method, device and equipment | |
CN113569293B (en) | Similar user acquisition method, system, electronic equipment and medium | |
CN115221350A (en) | Event audio detection method and system based on small sample metric learning | |
CN113780318B (en) | Method, device, server and medium for generating prompt information | |
CN115129863A (en) | Intention recognition method, device, equipment, storage medium and computer program product | |
CN114170000A (en) | Credit card user risk category identification method, device, computer equipment and medium | |
CN111708872A (en) | Conversation method, conversation device and electronic equipment | |
CN117235236B (en) | Dialogue method, dialogue device, computer equipment and storage medium | |
CN112214556B (en) | Label generation method, label generation device, electronic equipment and computer readable storage medium | |
Liu et al. | Deep hashing based on triplet labels and quantitative regularization term with exponential convergence | |
CN117390461A (en) | Data processing method, apparatus, device, medium, and program product | |
CN115292464A (en) | Session processing method, device, equipment and medium based on artificial intelligence | |
CN116483696A (en) | Test case generation method, device, computer equipment and storage medium | |
CN116860972A (en) | Interactive information classification method, device, apparatus, storage medium and program product | |
CN113254791A (en) | Data matching method and device, computer readable storage medium and equipment | |
CN115309969A (en) | Search intention recognition method, device, medium, program product and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20240808 Address after: 201615 room 1904, G60 Kechuang building, No. 650, Xinzhuan Road, Songjiang District, Shanghai Patentee after: Shanghai Mingping Medical Data Technology Co.,Ltd. Country or region after: China Address before: 102400 no.86-n3557, Wanxing Road, Changyang, Fangshan District, Beijing Patentee before: Mingpinyun (Beijing) data Technology Co.,Ltd. Country or region before: China |