CN113569293A - Similar user acquisition method, system, electronic device and medium - Google Patents

Similar user acquisition method, system, electronic device and medium Download PDF

Info

Publication number
CN113569293A
CN113569293A CN202110922782.4A CN202110922782A CN113569293A CN 113569293 A CN113569293 A CN 113569293A CN 202110922782 A CN202110922782 A CN 202110922782A CN 113569293 A CN113569293 A CN 113569293A
Authority
CN
China
Prior art keywords
data
user
model
sample
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110922782.4A
Other languages
Chinese (zh)
Other versions
CN113569293B (en
Inventor
姚娟娟
钟南山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Mingping Medical Data Technology Co ltd
Original Assignee
Mingpinyun Beijing Data Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mingpinyun Beijing Data Technology Co Ltd filed Critical Mingpinyun Beijing Data Technology Co Ltd
Priority to CN202110922782.4A priority Critical patent/CN113569293B/en
Publication of CN113569293A publication Critical patent/CN113569293A/en
Application granted granted Critical
Publication of CN113569293B publication Critical patent/CN113569293B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Public Health (AREA)
  • Computing Systems (AREA)
  • Bioethics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Primary Health Care (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention is suitable for the technical field of data processing, and provides a method, a system, electronic equipment and a medium for acquiring similar users, wherein the method comprises the following steps: acquiring labeled text data of a target field to form a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set; acquiring text data of a sample user, inputting the text data into a first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set; acquiring text data of a target user, inputting the text data into a first model, and acquiring an extraction result; desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users; the problem of the big difficulty of obtaining effective user data caused by the big user data volume is solved.

Description

Similar user acquisition method, system, electronic device and medium
Technical Field
The present invention relates to the field of data processing technologies, and in particular, to a method, a system, an electronic device, and a medium for acquiring similar users.
Background
With the development of society and the rapid advance of technology, more and more platforms are used by users, and thus, the amount of data generated by users is also exponentially increased. However, due to the increase of the data volume, the difficulty of screening valid data from the massive data is also increased. In addition, data generated by the user also includes a lot of data related to user privacy, so that the user privacy data needs to be protected while effective data is screened out.
Disclosure of Invention
The invention provides a method, a system, electronic equipment and a medium for acquiring similar users, which aim to solve the problem of high difficulty in acquiring effective user data caused by large user data volume in the prior art.
The method for acquiring the similar users comprises the following steps:
the method comprises the steps of obtaining labeled text data of a target field, forming a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;
acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;
desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.
Optionally, the step of inputting the target desensitization data into the recognition model to obtain similar users specifically includes:
inputting the target desensitization data into the identification model to obtain a target user category;
obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of users to be matched;
and acquiring the similarity between the target user and the user to be matched, and acquiring similar users according to the similarity.
Optionally, the step of obtaining the similarity between the target user and the user to be matched specifically includes:
acquiring text data of the user to be matched, and performing feature extraction on the text data to obtain first feature data;
performing the feature extraction on the text data of the target user to obtain second feature data;
and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
Optionally, the step of obtaining the similarity between the first feature data and the second feature data includes:
acquiring a characteristic category standard index;
and acquiring the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata.
Optionally, the second model is a convolutional neural network, and the third model is Softmax.
Optionally, the step of establishing an identification model for user category identification according to the second sample data set specifically includes:
inputting the second sample data set into the convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features.
Optionally, the step of obtaining text data of a sample user, inputting the text data of the sample user into the first model, and forming a second sample data set specifically includes:
acquiring text data of the sample user, and performing vectorization processing on the text data of the sample user to obtain a vector data set;
inputting the vector data set into the first model to obtain sample sensitive data;
and performing desensitization treatment on the text data of the sample user according to the sample sensitive data to form a second sample data set.
The invention also provides a system for acquiring similar users, which comprises:
the first model establishing module is used for acquiring labeled text data of a target field to form a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;
the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;
and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.
The present invention also provides an electronic device comprising: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the similar user acquisition method.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the similar user acquisition method as described above.
The invention has the beneficial effects that: according to the method for acquiring the similar users, a first model for sensitive word extraction is established according to the labeled text data, then an identification model for user category identification is established according to a second sample data set, the first model is adopted to extract the sensitive words from the text data of the target user, target desensitization data is acquired according to the extraction result, and the target desensitization data is input into the identification model, so that the similar users are acquired. Sensitive word extraction is carried out on the text data of the target user through the first model, so that the protection of the privacy data of the target user is realized; similar users are obtained by inputting the target desensitization data into the identification model, so that the similar users of the target users are obtained, and subsequent related schemes or related data recommendation is facilitated.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed for the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for acquiring similar users according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for obtaining similarity between a target user and a user to be matched according to an embodiment of the present invention;
fig. 3 is a schematic flow chart of an acquisition system of similar users in the embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.
It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.
In order to explain the technical means of the present invention, the following description will be given by way of specific examples.
In the field of medical health, due to the problems of medical resource shortage, medical resource uneven distribution, medical knowledge shortage, doctor-seeking concept shortage, doctor-patient relationship shortage and the like of the conventional medical service system, the online inquiry platform is rapidly developed and widely spread in recent years. However, the online inquiry often has the following problems: the private data of the user is leaked, so that the personal private information is stolen, the user is often difficult to accurately and comprehensively describe the self condition, and the question of the user cannot be answered in time. The invention provides a similar user acquisition method in order to identify the semantics and the purpose of a user, avoid the problems of long-time waiting of the user and leakage of privacy data of the user and simultaneously utilize an online inquiry platform to solve historical cases with high quality.
First embodiment
Fig. 1 is a flowchart illustrating a similar user obtaining method according to an embodiment of the present invention.
As shown in fig. 1, the similar user obtaining method includes steps S110 to S140:
s110, obtaining labeled text data of the target field to form a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;
s120, acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set;
s130, acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;
and S140, desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.
In step S110 of the present embodiment, the target field includes at least a medical field; taking the labeled text data in the medical field as an example, the labeled text data in the target field is derived from publicly available medical text data at home and abroad, such as periodicals, inquiry records, doctor orders, electronic medical records and the like in the medical field; various paper disease diagnosis records can be recorded by scanning or other modes to form labeled text data in the medical field. The method for acquiring the first sample data set comprises the following steps: pre-training the obtained labeled text data of the target field by adopting a word vector training model, and obtaining a first sample data set according to the pre-trained text data; the word vector pre-training model includes, but is not limited to, BERT. The first model is a named entity model, which includes, but is not limited to, BilSTM-CRF. And training the first model by adopting the first sample data set, and outputting the result as the user sensitive word. User sensitive words include, but are not limited to, identification number, name, address, cell phone number, bank card number, and zip code.
In step S120 of this embodiment, the text data of the sample user is derived from the electronic medical record and the medical text of the sample user. The step of obtaining the second sample data set comprises: obtaining text data of a sample user, and then carrying out vectorization processing on the text data to obtain a vector data set; inputting the vector data set into a first model to obtain sample sensitive data; and desensitizing the text data of the sample user according to the sample sensitive data to form a second sample data set. The word vector pre-training model may be used to vectorize the text data of the sample user. Desensitizing the text data of the sample user according to the sample sensitive data may be deleting the sample sensitive data in the text data of the sample user, or replacing the sample sensitive data in the text data of the sample user with characters. The method comprises the steps of inputting text data of a sample user into a first model for sensitive word extraction to obtain sample sensitive data, and then carrying out desensitization processing on the text data of the sample user according to the sample sensitive data, so that desensitization processing on the sample user sensitive words is realized, the problem of leakage of the privacy data of the sample user is effectively solved, and the problem that personal privacy information of the sample user is stolen is avoided.
In an embodiment, the recognition model comprises a second model for feature extraction and a third model for prediction of user classes. The second model is a convolutional neural network model, and the third model is Softmax. And inputting the second sample data set into a convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features. Specifically, disease category (user category) labeling is carried out on partial data in the second sample data set, then the second sample data set after labeling is adopted to train the convolutional neural network model and Softmax, the cross entropy loss function is adopted to determine the loss of final classification after forward propagation, the back propagation algorithm is adopted to carry out iterative updating on the weight until the loss value of the full convolutional neural network model tends to converge, and the identification model is obtained. The convolutional neural network comprises an input layer, a convolutional layer containing a plurality of convolution kernels, a pooling layer, two Dropout layers, two full-link layers and an output layer, and the activation function is a ReLu function.
In step S130 of this embodiment, the text data of the target user is derived from the electronic medical record or medical text of the target user. Inputting the acquired text data of the target user into a first model for extracting sensitive words, and acquiring an extraction result; and the extraction result is the sensitive words of the target user.
In step S140 of this embodiment, desensitization processing is performed on the text data of the target user according to the extraction result, and a desensitization processing mode is consistent with a desensitization processing mode performed on the text data of the sample user, which is not described herein again. Desensitizing the text data of the target user according to the extraction result, so that desensitizing of sensitive words of the target user is realized, the problem of leakage of privacy data of the target user is effectively prevented, and the problem that personal privacy information of the target user is stolen is avoided.
In one embodiment, the step of entering target desensitization data into a recognition model for similar users comprises: inputting the target desensitization data into the identification model to obtain a target user category; obtaining a plurality of corresponding sample users according to the category of a target user to obtain a plurality of users to be matched; and acquiring the similarity between the target user and the user to be matched, and acquiring the similar user according to the similarity between the target user and the user to be matched. The method has the advantages that the similar users corresponding to the target user can be obtained only by obtaining the similarity between the target user and the user to be matched, and the method of obtaining the similar users by obtaining the similarity between the target user and all sample users is avoided, so that invalid data processing is avoided, and the processing efficiency of medical data is improved.
In an embodiment, please refer to fig. 2 for a specific implementation method for obtaining the similarity between the target user and the user to be matched, and fig. 2 is a schematic flow chart of the method for obtaining the similarity between the target user and the user to be matched according to an embodiment of the present invention.
As shown in fig. 2, obtaining the similarity between the target user and the user to be matched may include the following steps S210 to S230:
s210, acquiring text data of a user to be matched, and performing feature extraction on the text data to obtain first feature data;
s220, performing feature extraction on the text data of the target user to obtain second feature data;
and S230, acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
In an embodiment, before step S210, the method further includes: according to the user category of the user to be matched and the user category of the target user, disease evaluation data corresponding to the user category (disease category) is obtained, feature extraction is carried out on the text data of the user to be matched and the text data of the target user, namely, disease evaluation data (first feature data and second feature data) in the text data of the user to be matched and the text data of the target user are extracted, and the disease evaluation data comprises but is not limited to heart rate data, blood pressure data, blood sugar data and blood oxygen saturation data. The first characteristic data comprises a plurality of first characteristic subdata and a plurality of characteristic categories, the second characteristic data comprises a plurality of first characteristic subdata and a plurality of characteristic categories, the number of the first characteristic subdata is the same as that of the second characteristic subdata, and the characteristic categories of the first characteristic data and the second characteristic data are also the same. Characteristic categories include, but are not limited to, heart rate, blood pressure, blood glucose, blood oxygen saturation.
In an embodiment, the obtaining the similarity between the first feature data and the second feature data includes: acquiring a characteristic category standard index; and obtaining the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata. The characteristic category criteria indicators may be obtained directly from published medical data.
Specifically, the mathematical expression of the similarity of the first feature data and the second feature data is:
Figure 312100DEST_PATH_IMAGE001
wherein S (X1, X2) is the similarity of the first feature data and the second feature data; x1 is first feature data, X2 is second feature data, n is the number of feature classes corresponding to the first feature data and the second feature data, i is the mark number of the feature class, aiIs the first characteristic subdata corresponding to the characteristic category i, biIs the second characteristic subdata l corresponding to the characteristic category iiIs a characteristic class standard index, m, corresponding to the characteristic class iiAnd the preset feature class weight corresponding to the feature class i. The preset feature category weight can be determined according to the user categories corresponding to the target user and the user to be matched; specifically. The preset feature category weight may be determined according to a correlation between a feature category corresponding to the user category and a disease corresponding to the user category, where the correlation is positively correlated with the setting of the preset feature category weight value. The stronger the correlation, the greater the preset feature class weight. The relevance of the feature categories corresponding to the user categories and the diseases corresponding to the user categories can be directly obtained according to public medical knowledge.
In an embodiment, the maximum similarity between the target user and all the users to be matched is determined from the similar users according to the similarity between the target user and the users to be matched, and the user to be matched corresponding to the maximum similarity is taken as the similar user. By confirming the similar users, the electronic medical records or medical texts of the similar users are fed back to the target user, so that the long-time waiting of the users can be avoided, the electronic medical records or medical texts of the similar users are fed back to a clinician, a diagnosis reference is provided for the clinician, and the diagnosis efficiency of the clinician can be improved.
Second embodiment
Based on the same inventive concept as the method in the first embodiment, correspondingly, the embodiment also provides an acquisition system of similar users.
Fig. 3 is a schematic flow chart of an acquisition system for similar users according to the present invention.
As shown in fig. 3, the system 3 shown comprises: 31 a first model building module, 32 a recognition model building module, 33 an extraction result obtaining module and 34 a similar user obtaining module.
The first model establishing module is used for acquiring labeled text data of the target field, forming a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;
the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;
and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.
In some exemplary embodiments, the similar user acquisition module comprises:
the user category acquisition unit is used for inputting the target desensitization data into the identification model to obtain a target user category;
the to-be-matched user obtaining unit is used for obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of to-be-matched users;
the similar user acquisition unit is used for acquiring the similarity between the target user and the user to be matched and acquiring the similar user according to the similarity between the target user and the user to be matched;
acquiring text data of a user to be matched, and performing feature extraction on the text data to obtain first feature data;
performing feature extraction on the text data of the target user to obtain second feature data;
and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
In some exemplary embodiments, the similar user acquiring unit includes:
the index acquisition unit is used for acquiring a characteristic category standard index;
and the similarity obtaining subunit is configured to obtain a similarity between the first feature data and the second feature data according to a preset feature category weight, a feature category standard index, the first feature sub-data, and the second feature sub-data.
In some exemplary embodiments, the recognition model building module comprises:
and inputting the second sample data set into a second model, extracting the features of the second sample data set, inputting the extracted features into a third model, and classifying the extracted features.
The present embodiment also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements any of the methods in the present embodiments.
The present embodiment also provides an electronic device, including: a processor and a memory;
the memory is used for storing computer programs, and the processor is used for executing the computer programs stored by the memory so as to enable the electronic equipment to execute the method in the embodiment.
The computer-readable storage medium in the present embodiment can be understood by those skilled in the art as follows: all or part of the steps for implementing the above method embodiments may be performed by hardware associated with a computer program. The aforementioned computer program may be stored in a computer readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The electronic device provided by the embodiment comprises a processor, a memory, a transceiver and a communication interface, wherein the memory and the communication interface are connected with the processor and the transceiver and are used for realizing mutual communication, the memory is used for storing a computer program, the communication interface is used for carrying out communication, and the processor and the transceiver are used for operating the computer program to enable the electronic device to execute the steps of the method.
In this embodiment, the Memory may include a Random Access Memory (RAM), and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.
The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.
In the above-described embodiments, references in the specification to "the present embodiment," "an embodiment," "another embodiment," "an example embodiment," or "other embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least some embodiments, but not necessarily all embodiments. The multiple occurrences of "the present embodiment," "one embodiment," "another embodiment," "an example embodiment," do not necessarily all refer to the same embodiment.
In the embodiments described above, although the present invention has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those skilled in the art in light of the foregoing description. For example, other memory structures (e.g., dynamic ram (dram)) may use the discussed embodiments. The embodiments of the invention are intended to embrace all such alternatives, modifications and variances that fall within the broad scope of the appended claims.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.
The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A method for obtaining similar users, the method comprising:
the method comprises the steps of obtaining labeled text data of a target field, forming a first sample data set, and establishing a first model for sensitive word extraction according to the first sample data set;
acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
acquiring text data of a target user, inputting the text data into the first model, and acquiring an extraction result;
desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the recognition model to obtain similar users.
2. The method for acquiring similar users according to claim 1, wherein the step of inputting the target desensitization data into the recognition model to acquire similar users specifically comprises:
inputting the target desensitization data into the identification model to obtain a target user category;
obtaining a plurality of corresponding sample users according to the target user category to obtain a plurality of users to be matched;
and acquiring the similarity between a target user and a user to be matched, and acquiring a similar user according to the similarity between the target user and the user to be matched.
3. The method for acquiring similar users according to claim 2, wherein the step of acquiring the similarity between the target user and the user to be matched specifically comprises:
acquiring text data of the user to be matched, and performing feature extraction on the text data to obtain first feature data;
performing the feature extraction on the text data of the target user to obtain second feature data;
and acquiring the similarity of the first characteristic data and the second characteristic data, and acquiring the similarity of the target user and the user to be matched according to the similarity of the first characteristic data and the second characteristic data.
4. The method for obtaining similar users according to claim 3, wherein the first feature data includes a feature category and first feature sub-data, the second feature data includes a feature category and second feature sub-data, and the step of obtaining the similarity between the first feature data and the second feature data specifically includes:
acquiring a characteristic category standard index;
and acquiring the similarity of the first characteristic data and the second characteristic data according to the preset characteristic category weight, the characteristic category standard index, the first characteristic subdata and the second characteristic subdata.
5. The similar user acquisition method according to claim 1, wherein the second model is a convolutional neural network, and the third model is Softmax.
6. The method for acquiring similar users according to claim 5, wherein the step of establishing an identification model for user category identification according to the second sample data set specifically includes:
inputting the second sample data set into the convolutional neural network, extracting the features of the second sample data set, inputting the extracted features into Softmax, and classifying the extracted features.
7. The method according to claim 1, wherein the step of obtaining the text data of the sample user, inputting the text data of the sample user into the first model, and forming a second sample data set specifically includes:
acquiring text data of the sample user, and performing vectorization processing on the text data of the sample user to obtain a vector data set;
inputting the vector data set into the first model to obtain sample sensitive data;
and performing desensitization treatment on the text data of the sample user according to the sample sensitive data to form a second sample data set.
8. A similar user acquisition system, the system comprising:
the first model establishing module is used for acquiring labeled text data of a target field to form a first sample data set and establishing a first model for sensitive word extraction according to the first sample data set;
the identification model establishing module is used for acquiring text data of a sample user, inputting the text data into the first model to form a second sample data set, and establishing an identification model for user category identification according to the second sample data set, wherein the identification model comprises a second model for feature extraction and a third model for predicting user categories;
the extraction result acquisition module is used for acquiring text data of a target user, inputting the text data into the first model and acquiring an extraction result;
and the similar user acquisition module is used for desensitizing the text data of the target user according to the extraction result to obtain target desensitization data, and inputting the target desensitization data into the identification model to acquire similar users.
9. An electronic device comprising a processor, a memory, and a communication bus;
the communication bus is used for connecting the processor and the memory;
the processor is configured to execute a computer program stored in the memory to implement the method of any one of claims 1-7.
10. A computer-readable storage medium, having stored thereon a computer program for causing a computer to perform the method of any one of claims 1-7.
CN202110922782.4A 2021-08-12 2021-08-12 Similar user acquisition method, system, electronic equipment and medium Active CN113569293B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110922782.4A CN113569293B (en) 2021-08-12 2021-08-12 Similar user acquisition method, system, electronic equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110922782.4A CN113569293B (en) 2021-08-12 2021-08-12 Similar user acquisition method, system, electronic equipment and medium

Publications (2)

Publication Number Publication Date
CN113569293A true CN113569293A (en) 2021-10-29
CN113569293B CN113569293B (en) 2024-06-07

Family

ID=78171483

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110922782.4A Active CN113569293B (en) 2021-08-12 2021-08-12 Similar user acquisition method, system, electronic equipment and medium

Country Status (1)

Country Link
CN (1) CN113569293B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110807207A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN110909224A (en) * 2019-11-22 2020-03-24 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
US20200097910A1 (en) * 2017-06-12 2020-03-26 Sensory Technologies Of Canada Inc. A system for generating a record of community-based patient care
CN111143884A (en) * 2019-12-31 2020-05-12 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN112071425A (en) * 2020-09-04 2020-12-11 平安科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
US20200402625A1 (en) * 2019-06-21 2020-12-24 nference, inc. Systems and methods for computing with private healthcare data
CN112329055A (en) * 2020-11-02 2021-02-05 微医云(杭州)控股有限公司 Method and device for desensitizing user data, electronic equipment and storage medium
CN112784008A (en) * 2020-07-16 2021-05-11 上海芯翌智能科技有限公司 Case similarity determining method and device, storage medium and terminal
CN113127605A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing target recognition model, electronic equipment and medium
CN113160999A (en) * 2021-04-25 2021-07-23 厦门拜特信息科技有限公司 Data structured analysis system and data processing method for medical decision
CN113221747A (en) * 2021-05-13 2021-08-06 支付宝(杭州)信息技术有限公司 Privacy data processing method, device and equipment based on privacy protection

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200097910A1 (en) * 2017-06-12 2020-03-26 Sensory Technologies Of Canada Inc. A system for generating a record of community-based patient care
US20200402625A1 (en) * 2019-06-21 2020-12-24 nference, inc. Systems and methods for computing with private healthcare data
CN110807207A (en) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 Data processing method and device, electronic equipment and storage medium
CN110909224A (en) * 2019-11-22 2020-03-24 浙江大学 Sensitive data automatic classification and identification method and system based on artificial intelligence
CN111143884A (en) * 2019-12-31 2020-05-12 北京懿医云科技有限公司 Data desensitization method and device, electronic equipment and storage medium
CN112784008A (en) * 2020-07-16 2021-05-11 上海芯翌智能科技有限公司 Case similarity determining method and device, storage medium and terminal
CN112071425A (en) * 2020-09-04 2020-12-11 平安科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN112329055A (en) * 2020-11-02 2021-02-05 微医云(杭州)控股有限公司 Method and device for desensitizing user data, electronic equipment and storage medium
CN113160999A (en) * 2021-04-25 2021-07-23 厦门拜特信息科技有限公司 Data structured analysis system and data processing method for medical decision
CN113221747A (en) * 2021-05-13 2021-08-06 支付宝(杭州)信息技术有限公司 Privacy data processing method, device and equipment based on privacy protection
CN113127605A (en) * 2021-06-17 2021-07-16 明品云(北京)数据科技有限公司 Method and system for establishing target recognition model, electronic equipment and medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YONG MA 等: "Quantitative evaluation model of desensitization algorithm", 2021 IEEE ASIA-PACIFIC CONFERENCE ON IMAGE PROCESSING, ELECTRONICS AND COMPUTERS (IPEC), 7 May 2021 (2021-05-07) *
张佳影: "区域医疗健康平台中检验检查指标的标准化算法", 武汉大学学报(理学版), vol. 56, no. 09, 30 September 2019 (2019-09-30) *
郭进京;张雪;林鑫;任慧玲;: "国内患者隐私泄露情形及隐私保护现状分析", 医学信息学杂志, no. 02, 25 February 2020 (2020-02-25) *

Also Published As

Publication number Publication date
CN113569293B (en) 2024-06-07

Similar Documents

Publication Publication Date Title
US20190392258A1 (en) Method and apparatus for generating information
CN106874253A (en) Recognize the method and device of sensitive information
CN113127605B (en) Method and system for establishing target recognition model, electronic equipment and medium
CN111600874B (en) User account detection method and device, electronic equipment and medium
CN113095408A (en) Risk determination method and device and server
CN113569933A (en) Trademark pattern matching method and corresponding device, equipment and medium
CN112070550A (en) Keyword determination method, device and equipment based on search platform and storage medium
CN118227771B (en) Knowledge question-answering processing method and device
CN114398557A (en) Information recommendation method and device based on double portraits, electronic equipment and storage medium
CN114417883B (en) Data processing method, device and equipment
CN113569293B (en) Similar user acquisition method, system, electronic equipment and medium
CN115221350A (en) Event audio detection method and system based on small sample metric learning
CN113780318B (en) Method, device, server and medium for generating prompt information
CN115129863A (en) Intention recognition method, device, equipment, storage medium and computer program product
CN114170000A (en) Credit card user risk category identification method, device, computer equipment and medium
CN111708872A (en) Conversation method, conversation device and electronic equipment
CN117235236B (en) Dialogue method, dialogue device, computer equipment and storage medium
CN112214556B (en) Label generation method, label generation device, electronic equipment and computer readable storage medium
Liu et al. Deep hashing based on triplet labels and quantitative regularization term with exponential convergence
CN117390461A (en) Data processing method, apparatus, device, medium, and program product
CN115292464A (en) Session processing method, device, equipment and medium based on artificial intelligence
CN116483696A (en) Test case generation method, device, computer equipment and storage medium
CN116860972A (en) Interactive information classification method, device, apparatus, storage medium and program product
CN113254791A (en) Data matching method and device, computer readable storage medium and equipment
CN115309969A (en) Search intention recognition method, device, medium, program product and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20240808

Address after: 201615 room 1904, G60 Kechuang building, No. 650, Xinzhuan Road, Songjiang District, Shanghai

Patentee after: Shanghai Mingping Medical Data Technology Co.,Ltd.

Country or region after: China

Address before: 102400 no.86-n3557, Wanxing Road, Changyang, Fangshan District, Beijing

Patentee before: Mingpinyun (Beijing) data Technology Co.,Ltd.

Country or region before: China