CN113420148B - Training text acquisition method, system and equipment for sensitive content quality inspection model - Google Patents

Training text acquisition method, system and equipment for sensitive content quality inspection model Download PDF

Info

Publication number
CN113420148B
CN113420148B CN202110691208.2A CN202110691208A CN113420148B CN 113420148 B CN113420148 B CN 113420148B CN 202110691208 A CN202110691208 A CN 202110691208A CN 113420148 B CN113420148 B CN 113420148B
Authority
CN
China
Prior art keywords
map
ith
account
community
communities
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110691208.2A
Other languages
Chinese (zh)
Other versions
CN113420148A (en
Inventor
成杰峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202110691208.2A priority Critical patent/CN113420148B/en
Publication of CN113420148A publication Critical patent/CN113420148A/en
Application granted granted Critical
Publication of CN113420148B publication Critical patent/CN113420148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Animal Behavior & Ethology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the field of data acquisition, and provides a training text acquisition method for a sensitive content quality inspection model, which comprises the following steps: acquiring account data of a plurality of users and relationship data among the users to obtain the account data and the relationship data; constructing an account contact map according to the plurality of account data and the plurality of relationship data; clustering the account data based on the account contact map to obtain a plurality of user sets; selecting a sensitive account number set from the plurality of user sets, wherein the sensitive account number set comprises a plurality of sensitive users; collecting historical texts of each sensitive user in a preset time window to obtain a plurality of historical texts; and screening the plurality of historical texts to obtain a plurality of training texts for training the sensitive content quality inspection model. The method reduces the acquisition cost and the acquisition difficulty of the training text and improves the accuracy and the acquisition efficiency of the training text.

Description

Training text acquisition method, system and equipment for sensitive content quality inspection model
Technical Field
The embodiment of the invention relates to the field of data acquisition, in particular to a training text acquisition method, a training text acquisition system and training text acquisition equipment for a sensitive content quality inspection model.
Background
With the rapid development of the internet and the application of the whole people, the internet public opinion becomes a very important part of the social public opinion. Compared with the traditional media (television, newspaper, broadcasting and the like), the Internet carrying the network public opinion has the characteristics of over high degree of freedom of the speech, burstiness, quick propagation, wide audience and the like, and also provides real-time, high precision and the like for the public opinion monitoring system. Therefore, sensitive malicious content quality inspection for sensitive malicious content that is malicious to spread over a network is particularly important.
The problem of quality inspection of sensitive content can also be regarded as a short text classification problem, i.e. determining whether a text message sent by a user belongs to normal text or illegal text. Conventional sensitive content identification models typically employ supervised machine learning methods. While the recognition rate of the sensitive content recognition model often depends on the effectiveness of the training text, i.e., the more effective training text is used, the recognition rate of the sensitive content recognition model can be increased. However, if a malicious user bypasses the sensitive content recognition model or interception of the traditional security policy through low-cost means such as inter-impurity special symbols, homonym transformation, isolated words, shape-near word transformation, component parts and the like, the traditional sensitive content recognition model cannot effectively intercept; the training texts are difficult to acquire, the existing training texts need to be acquired and screened manually, and the acquisition speed is difficult to keep pace with the change speed of the sensitive content. Therefore, how to improve the speed and efficiency of acquiring the training text of the quality inspection model of the sensitive content becomes a technical problem to be solved currently.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a method, a system, a device and a readable storage medium for acquiring training text of a quality inspection model of sensitive content, so as to solve the problems of high difficulty in acquiring the training text of the quality inspection model of sensitive content and low acquisition speed and efficiency.
To achieve the above object, an embodiment of the present invention provides a training text collection method for a sensitive content quality inspection model, where the method includes the steps of:
acquiring account data of a plurality of users and relationship data among the users to obtain the account data and the relationship data;
constructing an account contact map according to the plurality of account data and the plurality of relationship data;
clustering the account data based on the account contact map to obtain a plurality of user sets;
selecting a sensitive account number set from the plurality of user sets, wherein the sensitive account number set comprises a plurality of sensitive users;
collecting historical texts of each sensitive user in a preset time window to obtain a plurality of historical texts; and
And screening the historical texts to obtain training texts for training the sensitive content quality inspection model.
Optionally, the step of constructing an account contact map according to the plurality of account data and the plurality of relationship data includes:
defining each account data as an entity V to obtain an account set V= { V corresponding to the plurality of account data 1 ,v 2 ,…,v n };
Defining each relationship data as an edge E to obtain a relationship set E= { E corresponding to the relationship data 1 ,e 2 ,…,e m -a }; and
And defining the account contact map according to each entity v in the account set and each side e in the relation set.
Optionally, the step of clustering each account data based on the account contact map to obtain a plurality of user sets includes:
performing map partitioning operation on the account contact map based on a community partitioning algorithm to obtain a plurality of target map communities; and
And generating a user set according to the user account number in each target graph community so as to obtain the user sets.
Optionally, the step of performing a graph partitioning operation on the account contact graph based on the community partitioning algorithm to obtain a plurality of target graph communities includes:
initializing the account contact map to divide each entity of the account contact map into a plurality of initial map communities;
performing the ith partitioning operation: dividing each entity in each i-1 th divided graph community into graph communities adjacent to the entity to generate a plurality of i-th divided graph communities; i is a positive integer, and when i is 1, the i-1 th divided map community is the initial map community; when i is more than 1, the map communities divided in the ith-1 th time are map communities obtained by the ith-1 st time of dividing operation;
performing the ith build operation: a plurality of ith constructed community networks constructed based on the plurality of ith partitioned graph communities, wherein each ith partitioned graph community corresponds to one ith constructed community network;
judging whether the network structure of each community network constructed at the ith time is the same as that of the community network constructed at the corresponding ith-1 time;
if the network structure of each ith constructed community network is different from that of the corresponding ith-1 constructed community network, executing an (i+1) th partitioning operation and an (i+1) th constructing operation;
if the network structure of each ith constructed community network is the same as that of the corresponding ith-1 constructed community network, the ith+1st division operation and the ith+1st construction operation are not executed, and the ith divided map communities are used as the target map communities.
Optionally, the step of dividing each entity in the i-1 th divided atlas communities into atlas communities adjacent to the entity to generate a plurality of i-th divided atlas communities includes:
calculating a first modularity of a target entity of each i-1 th divided spectrum community, wherein the first modularity is the modularity of the target entity before being divided into adjacent spectrum communities, the modularity is used for representing the stability of the entity in the corresponding spectrum communities, and the target entity is any entity in each i-1 th divided spectrum community;
calculating a second modularity of the target entity, wherein the second modularity is the modularity of the target entity after being divided into adjacent map communities;
judging whether the first modularity of the target entity is smaller than the second modularity; and
And if the first modularity of the target entity is not less than the second modularity, generating an ith divided map community based on the target entity.
Optionally, the step of performing a filtering operation on the plurality of historical texts to obtain a plurality of training texts for training the sensitive content quality inspection model includes:
clustering the plurality of historical texts through preset sensitive words to obtain a plurality of clustered text sets;
screening the plurality of clustering text sets according to the preset sensitive words to obtain target clusters; and
And taking a plurality of texts in the target cluster as the plurality of training texts.
Optionally, the method further comprises: the plurality of training texts is uploaded to a blockchain.
To achieve the above object, an embodiment of the present invention further provides a training text collection system for a sensitive content quality inspection model, including:
the acquisition module is used for acquiring account data of a plurality of users and relationship data among the users so as to acquire the account data and the relationship data;
the construction module is used for constructing an account contact map according to the plurality of account data and the plurality of relation data;
the clustering module is used for clustering the account data based on the account contact map so as to obtain a plurality of user sets;
the selection module is used for selecting a sensitive account number set from the plurality of user sets, wherein the sensitive account number set comprises a plurality of sensitive users;
the acquisition module is used for acquiring historical texts of each sensitive user in a preset time window so as to obtain a plurality of historical texts; and
And the screening module is used for carrying out screening operation on the plurality of historical texts so as to obtain a plurality of training texts for training the sensitive content quality inspection model.
To achieve the above object, an embodiment of the present invention further provides a computer device, where the computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the computer program when executed by the processor implements the steps of the training text collection method for a sensitive content quality inspection model as described above.
To achieve the above object, an embodiment of the present invention further provides a computer readable storage medium having stored therein a computer program executable by at least one processor to cause the at least one processor to perform the steps of the training text acquisition method for a sensitive content quality inspection model as described above.
According to the training text acquisition method, system, computer equipment and computer readable storage medium for the sensitive content quality inspection model, provided by the embodiment of the invention, the account contact map is constructed, and a plurality of training texts are acquired from the historical texts of the sensitive accounts based on the selected sensitive account set, so that the manual screening links are reduced, the acquisition cost and the acquisition difficulty of the training texts are reduced, and the accuracy rate and the acquisition efficiency of the training texts are improved.
Drawings
FIG. 1 is a flow chart of a training text collection method for a sensitive content quality inspection model according to an embodiment of the present invention;
FIG. 2 is a schematic program diagram of a training text acquisition system for a sensitive content quality assurance model according to a second embodiment of the present invention;
fig. 3 is a schematic diagram of a hardware structure of a third embodiment of the computer device of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that the description of "first", "second", etc. in this disclosure is for descriptive purposes only and is not to be construed as indicating or implying a relative importance or implying an indication of the number of technical features being indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In addition, the technical solutions of the embodiments may be combined with each other, but it is necessary to base that the technical solutions can be realized by those skilled in the art, and when the technical solutions are contradictory or cannot be realized, the combination of the technical solutions should be considered to be absent and not within the scope of protection claimed in the present invention.
Example 1
Referring to FIG. 1, a flowchart of the steps of a training text collection method for a sensitive content quality assurance model is shown in an embodiment of the present invention. It will be appreciated that the flow charts in the method embodiments are not intended to limit the order in which the steps are performed. The training text acquisition system for the sensitive content quality inspection model in the present embodiment may be executed in the computer device 2, and is described below exemplarily with the computer device 2 as an execution subject. Specifically, the following is described.
Step S100, acquiring account data of a plurality of users and relationship data among the users to obtain the account data and the relationship data.
The account data may be a user ID of each user, and the relationship data may be data for recording an association relationship between each user, where the association relationship may include a relationship of whether each user account is a friend, whether the user account shares a device, whether the user account shares an IP, whether the user account shares a non-shared WiFi, and the like.
Step S102, an account contact map is constructed according to the plurality of account data and the plurality of relation data.
The plurality of accounts corresponding to the plurality of account data may be malicious accounts which have issued violation information.
In an actual scene, some malicious users bypass the sensitive content recognition model or interception of a traditional security policy through low-cost means such as inter-impurity special symbols, homonym transformation, orphan words, shape-near word transformation and radical splitting, so that the traditional sensitive content recognition model cannot be intercepted effectively, and the difficulty of acquiring training texts for training the sensitive content recognition model is increased. However, because of the limited resources, users of malicious accounts typically use limited devices to issue a large number of offending messages within a limited geographical area, and thus there is often a large aggregation between malicious accounts on IP and devices. According to the method and the device for obtaining the training text, the account contact map can be constructed based on the relation among the accounts, and the malicious accounts are located through the account contact map, so that malicious content is obtained from information published by the malicious accounts, and the obtaining difficulty of the training text is reduced. The account contact map may be a map g= (v, e) composed of a plurality of triples defined by the plurality of account data v and the plurality of relationship data e.
In an exemplary embodiment, the step S102 may further include a step S200 to a step S204, where: step S200, defining each account data as an entity V to obtain an account set v= { V corresponding to the plurality of account data 1 ,v 2 ,…,v n -a }; step S202, defining each relationship data as an edge E to obtain a relationship set E= { E corresponding to the plurality of relationship data 1 ,e 2 ,…,e m -a }; and step S204, defining the account contact map according to each entity v in the account set and each side e in the relation set. According to the embodiment, each account data is defined as the entity v, each relation data is defined as the side e, and the account contact map G= (v, e) is constructed based on the entity v and the side e, so that the intrinsic relation among the account data is recorded through the account contact map G= (v, e), and the screening efficiency of malicious accounts is improved.
Step S104, clustering the account data based on the account contact map to obtain a plurality of user sets.
In order to improve accuracy in locating malicious accounts, the embodiment can cluster each account data based on the account contact map so as to obtain a plurality of user sets.
In an exemplary embodiment, the step S104 may further include a step S300 to a step S302, where: step S300, carrying out map partitioning operation on the account contact map based on a community partitioning algorithm so as to obtain a plurality of target map communities; and step S302, generating a user set according to the user account in each target graph community so as to obtain the user sets. The community partitioning algorithm (e.g., fast-unfolding algorithm) is an iterative algorithm based on modularity to partition the graph network. According to the Fast-unfolding algorithm, the account number contact map is continuously divided, the intensity of the divided network community structure is continuously increased by improving the modularity of the whole network, and when the intensity of the community structure is not changed, the network community structure tends to be stable, and a plurality of target map communities with stable structures can be obtained. According to the embodiment, the stability of the map community is improved through an iterative algorithm, and the accuracy of locating the malicious account is further improved.
In an exemplary embodiment, the step S300 may further include a step S400 to a step S410, where: step S400, initializing the account contact map to divide each entity of the account contact map into a plurality of initial map communities; step S402, performing the ith division operation: dividing each entity in each i-1 th divided graph community into graph communities adjacent to the entity to generate a plurality of i-th divided graph communities; i is a positive integer, and when i is 1, the i-1 th divided map community is the initial map community; when i is more than 1, the map communities divided in the ith-1 th time are map communities obtained by the ith-1 st time of dividing operation; step S404, performing the ith construction operation: a plurality of ith constructed community networks constructed based on the plurality of ith partitioned graph communities, wherein each ith partitioned graph community corresponds to one ith constructed community network; step S406, judging whether the network structure of each community network constructed at the ith time is the same as the network structure of the community network constructed at the corresponding ith-1 time; step S408, if the network structure of each ith constructed community network is different from the network structure of the corresponding ith-1 constructed community network, the ith+1st partitioning operation and the ith+1st construction operation are executed; step S410, if the network structure of each ith constructed community network is the same as the network structure of the corresponding ith-1 constructed community network, the ith+1st division operation and the ith+1st construction operation are not executed, and the ith divided graph communities are used as the target graph communities. According to the method, the accuracy of dividing the account contact map is improved by executing the ith dividing operation and the ith constructing operation, so that the stability of the target map community is further improved.
In an exemplary embodiment, the step S402 may further include a step S500 to a step S506, where: step S500, calculating a first modularity of a target entity of each i-1 th divided graph community, wherein the first modularity is the modularity of the target entity before being divided into adjacent graph communities, the modularity is used for representing the stability of the entity in the corresponding graph communities, and the target entity is any entity in each i-1 th divided graph community; step S502, calculating a second modularity of the target entity, wherein the second modularity is the modularity of the target entity after being divided into adjacent map communities; step S504, determining whether the first modularity of the target entity is less than the second modularity; and step S506, if the first modularity of the target entity is not less than the second modularity, generating an ith sub-divided map community based on the target entity. The modularity (also called as: modular metric value) is used for measuring the structural strength of the network community, and the modularity is calculated as follows:
wherein Aij represents the weight, k, between entities i and j i Representing the sum of all edge weights connected to i, delta (∙) is an indicator function representing 1 if two entities are the same community, and 0 if two entities are not the same community.
In some embodiments, the difference between the first modularity and the second modularity may be represented by a modular gradient gain, and determining whether the target entity generates an ith sub-divided atlas community by determining whether the modular gradient gain is positive. Wherein the modular gradient gain is calculated as follows:
wherein, [ C]Represented is the sum of weights within community C,the sum of weights representing the edges of the physical connections inside the community C includes the edges inside the community and the edges outside the community.
Step S106, selecting a sensitive account number set from the plurality of user sets, wherein the sensitive account number set comprises a plurality of sensitive users.
After a plurality of user sets are obtained, a plurality of sensitive account sets can be screened out based on a preset blacklist. In this embodiment, the characteristics in the partners of each blacklist, such as the distribution characteristics of age and gender, and the information of IP addresses, may be counted in advance, each partner may be marked by the discovered characteristics of the blacklist partner, and the sensitive account set may be found out by classification of a supervision model.
Step S108, collecting historical texts of each sensitive user in a preset time window to obtain a plurality of historical texts.
Since the message content of the sensitive users in a certain time slice (a preset time window) generally has similarity, the embodiment can perform text clustering on text information sent by the sensitive accounts corresponding to all the sensitive users in the certain time slice, and screen out a plurality of historical texts, wherein the historical texts are texts sent by each sensitive user in the time slice.
Step S110, screening the plurality of historical texts to obtain a plurality of training texts for training the sensitive content quality inspection model.
In order to further improve the effectiveness of the training texts, the embodiment may further perform a screening operation on the plurality of historical texts, so as to obtain a plurality of training texts for training the sensitive content quality inspection model from the plurality of historical texts.
In an exemplary embodiment, the step S110 may further include a step S600 to a step S604, where: step S600, clustering the plurality of historical texts through preset sensitive words to obtain a plurality of clustered text sets; step S602, screening the plurality of clustering text sets according to the preset sensitive words to obtain target clusters; and step S604, taking a plurality of texts in the target cluster as the training texts. The DBSCAN algorithm is a density-based clustering algorithm, can divide a sufficiently high density trend into clusters, can identify any form of clusters in a noisy spatial database, and has very high robustness. Wherein, the DBSCAN algorithm classifies data points into three categories: a core point containing a number of points exceeding MinPts within a radius Eps; boundary points, points in the radius Eps having a number of points less than MinPts, but falling within the neighborhood of the core point; noise points are neither core points nor boundary points. The training texts screened out by the embodiment have the following advantages: the screened training text is better in representativeness, and can be traced back to similar malicious accounts; the method is convenient for subsequent further subdivision, and through the processing of a clustering algorithm, different types of sensitive contents can be subdivided to a certain extent, so that the subsequent processing of the texts is more convenient.
After the multiple target training texts are obtained, the embodiment can train the sensitive content recognition model based on the multiple training texts so as to optimize the sensitive content recognition model. According to the method, the device and the system, the account contact map is constructed, and based on the fact that the sensitive account set is selected, a plurality of training texts are obtained from the historical texts of the sensitive account, so that manual screening links are reduced, the obtaining cost and the obtaining difficulty of the training texts are reduced, and the accuracy rate and the obtaining efficiency of the training texts are improved.
Illustratively, the training text collection method for the sensitive content quality inspection model further comprises: the plurality of training texts is uploaded to a blockchain.
Illustratively, uploading the plurality of training texts to the blockchain may ensure its security and fair transparency. The blockchain referred to in this example is a novel mode of application for computer technology such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
Example two
Fig. 2 is a schematic program module diagram of a training text acquisition system for a sensitive content quality inspection model according to a second embodiment of the present invention. Training text capture system 20 for a sensitive content quality inspection model may include or be partitioned into one or more program modules that are stored in a storage medium and executed by one or more processors to perform the present invention and to implement the training text capture method for a sensitive content quality inspection model described above. Program modules in accordance with embodiments of the present invention refer to a series of computer program instruction segments capable of performing particular functions, and are more suitable than the program itself for describing the execution of training text acquisition system 20 in a storage medium for a sensitive content quality assurance model. The following description will specifically describe functions of each program module of the present embodiment:
the obtaining module 200 is configured to obtain account data of a plurality of users and relationship data between the users, so as to obtain the plurality of account data and the plurality of relationship data.
A construction module 202 is configured to construct an account contact map according to the plurality of account data and the plurality of relationship data.
And the clustering module 204 is configured to cluster each account data based on the account contact map, so as to obtain a plurality of user sets.
A selecting module 206, configured to select a set of sensitive accounts from the multiple sets of users, where the set of sensitive accounts includes multiple sensitive users.
The collection module 208 is configured to collect historical texts of each sensitive user in a preset time window, so as to obtain a plurality of historical texts.
And the screening module 210 is configured to perform a screening operation on the plurality of historical texts to obtain a plurality of training texts for training the quality inspection model of the sensitive content.
Illustratively, the building module 202 is further configured to: defining each account data as an entity V to obtain an account set V= { V corresponding to the plurality of account data 1 ,v 2 ,…,v n -a }; defining each relationship data as an edge E to obtain a relationship set E= { E corresponding to the relationship data 1 ,e 2 ,…,e m -a }; and defining the account contact map according to each entity v in the account set and each side e in the relation set.
Illustratively, the clustering module 204 is further configured to: performing map partitioning operation on the account contact map based on a community partitioning algorithm to obtain a plurality of target map communities; and generating a user set according to the user account number in each target graph community so as to obtain the plurality of user sets.
Illustratively, the clustering module 204 is further configured to: initializing the account contact map to divide each entity of the account contact map into a plurality of initial map communities; performing the ith partitioning operation: dividing each entity in each i-1 th divided graph community into graph communities adjacent to the entity to generate a plurality of i-th divided graph communities; i is a positive integer, and when i is 1, the i-1 th divided map community is the initial map community; when i is more than 1, the map communities divided in the ith-1 th time are map communities obtained by the ith-1 st time of dividing operation; performing the ith build operation: a plurality of ith constructed community networks constructed based on the plurality of ith partitioned graph communities, wherein each ith partitioned graph community corresponds to one ith constructed community network; judging whether the network structure of each community network constructed at the ith time is the same as that of the community network constructed at the corresponding ith-1 time; if the network structure of each ith constructed community network is different from that of the corresponding ith-1 constructed community network, executing an (i+1) th partitioning operation and an (i+1) th constructing operation; if the network structure of each ith constructed community network is the same as that of the corresponding ith-1 constructed community network, the ith+1st division operation and the ith+1st construction operation are not executed, and the ith divided map communities are used as the target map communities.
Illustratively, the clustering module 204 is further configured to: calculating a first modularity of a target entity of each i-1 th divided spectrum community, wherein the first modularity is the modularity of the target entity before being divided into adjacent spectrum communities, the modularity is used for representing the stability of the entity in the corresponding spectrum communities, and the target entity is any entity in each i-1 th divided spectrum community; calculating a second modularity of the target entity, wherein the second modularity is the modularity of the target entity after being divided into adjacent map communities; judging whether the first modularity of the target entity is smaller than the second modularity; and if the first modularity of the target entity is not less than the second modularity, generating an ith divided map community based on the target entity.
Illustratively, the screening module 210 is further configured to: clustering the plurality of historical texts through preset sensitive words to obtain a plurality of clustered text sets; screening the plurality of clustering text sets according to the preset sensitive words to obtain target clusters; and taking a plurality of texts in the target cluster as the training texts.
Illustratively, the training text acquisition system 20 for the sensitive content quality inspection model further includes an upload module for uploading the plurality of training texts into a blockchain.
Example III
Referring to fig. 3, a hardware architecture diagram of a computer device according to a third embodiment of the present invention is shown. In the present embodiment, the computer device 2 is a device capable of automatically performing numerical calculation and/or information processing in accordance with instructions set or stored in advance. The computer device 2 may be a rack server, a blade server, a tower server, or a rack server (including a stand-alone server, or a server cluster made up of multiple servers), or the like. As shown, computer device 2 includes, but is not limited to, at least a memory 21, a processor 22, a network interface 23, and a training text acquisition system 20 for a sensitive content quality assurance model that are communicatively coupled to each other via a system bus.
In this embodiment, the memory 21 includes at least one type of computer-readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 21 may be an internal storage unit of the computer device 2, such as a hard disk or a memory of the computer device 2. In other embodiments, the memory 21 may also be an external storage device of the computer device 2, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 2. Of course, the memory 21 may also include both internal storage units of the computer device 2 and external storage devices. In this embodiment, the memory 21 is typically used to store an operating system and various application software installed on the computer device 2, such as program code of the training text acquisition system 20 for the sensitive content quality inspection model of the second embodiment. Further, the memory 21 may be used to temporarily store various types of data that have been output or are to be output.
The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is typically used to control the overall operation of the computer device 2. In this embodiment, the processor 22 is configured to execute the program code or process data stored in the memory 21, for example, execute the training text collecting system 20 for the sensitive content quality inspection model, so as to implement the training text collecting method for the sensitive content quality inspection model of the first embodiment.
The network interface 23 may comprise a wireless network interface or a wired network interface, which network interface 23 is typically used for establishing a communication connection between the computer apparatus 2 and other electronic devices. For example, the network interface 23 is used to connect the computer device 2 to an external terminal through a network, establish a data transmission channel and a communication connection between the computer device 2 and the external terminal, and the like. The network may be an Intranet (Intranet), the Internet (Internet), the Global System for Mobile communications (Global System of Mobile communicatI/On, GSM), wideband code division multiple Access (Wideband Code DivisI/On Multiple Access, WCDMA), 4G network, 5G network, bluetooth (Bluetooth), wi-Fi, or other wireless or wireline network.
It is noted that fig. 3 only shows a computer device 2 having components 20-23, but it is understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented.
In this embodiment, the training text acquisition system 20 for the sensitive content quality inspection model stored in the memory 21 may also be divided into one or more program modules, which are stored in the memory 21 and executed by one or more processors (the processor 22 in this embodiment) to complete the present invention.
For example, fig. 2 shows a schematic program module of the training text acquisition system 20 for a quality inspection model of sensitive content according to the second embodiment of the present invention, where the training text acquisition system 20 for a quality inspection model of sensitive content may be divided into an acquisition module 200, a construction module 202, a clustering module 204, a selection module 206, an acquisition module 208, and a screening module 210. Program modules in the present invention are understood to mean a series of computer program instruction segments capable of performing a specific function, more suitable than a program for describing the execution of the training text acquisition system 20 for a sensitive content quality assurance model in the computer device 2. The specific functions of the program modules 200-210 are described in detail in the second embodiment, and are not described herein.
Example IV
The present embodiment also provides a computer-readable storage medium such as a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, performs the corresponding functions. The computer readable storage medium of the present embodiment is used for the training text collection system 20 of the sensitive content quality inspection model, and when executed by the processor, implements the training text collection method for the sensitive content quality inspection model of the first embodiment.
The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (8)

1. A training text collection method for a sensitive content quality inspection model, the method comprising:
acquiring account data of a plurality of users and relationship data among the users to obtain the account data and the relationship data;
constructing an account contact map according to the plurality of account data and the plurality of relationship data;
clustering the account data based on the account contact map to obtain a plurality of user sets;
selecting a sensitive account number set from the plurality of user sets, wherein the sensitive account number set comprises a plurality of sensitive users;
collecting historical texts of each sensitive user in a preset time window to obtain a plurality of historical texts; and
Screening the plurality of historical texts to obtain a plurality of training texts for training the sensitive content quality inspection model;
the step of clustering the account data based on the account contact map to obtain a plurality of user sets includes:
performing map partitioning operation on the account contact map based on a community partitioning algorithm to obtain a plurality of target map communities; and
Generating a user set according to the user account number in each target graph community to obtain a plurality of user sets;
the step of performing map partitioning operation on the account contact map based on the community partitioning algorithm to obtain a plurality of target map communities comprises the following steps:
initializing the account contact map to divide each entity of the account contact map into a plurality of initial map communities;
performing the ith partitioning operation: dividing each entity in each i-1 th divided graph community into graph communities adjacent to the entity to generate a plurality of i-th divided graph communities; i is a positive integer, and when i is 1, the i-1 th divided map community is the initial map community; when i is more than 1, the map communities divided in the ith-1 th time are map communities obtained by the ith-1 st time of dividing operation;
performing the ith build operation: a plurality of ith constructed community networks constructed based on the plurality of ith partitioned graph communities, wherein each ith partitioned graph community corresponds to one ith constructed community network;
judging whether the network structure of each community network constructed at the ith time is the same as that of the community network constructed at the corresponding ith-1 time;
if the network structure of each ith constructed community network is different from that of the corresponding ith-1 constructed community network, executing an (i+1) th partitioning operation and an (i+1) th constructing operation;
if the network structure of each ith constructed community network is the same as that of the corresponding ith-1 constructed community network, the ith+1st division operation and the ith+1st construction operation are not executed, and the ith divided map communities are used as the target map communities.
2. The method for training text collection for a sensitive content quality assurance model of claim 1, wherein the step of constructing an account contact map from the plurality of account data and the plurality of relationship data comprises:
defining each account data as an entity V to obtain an account set V= { V corresponding to the plurality of account data 1 ,v 2 ,…,v n };
Defining each relationship data as an edge E to obtain a relationship set E= { E corresponding to the relationship data 1 ,e 2 ,…,e m -a }; and
And defining the account contact map according to each entity v in the account set and each side e in the relation set.
3. The method for training text collection for a sensitive content quality inspection model according to claim 1, wherein the step of partitioning each entity in the respective i-1 th partitioned atlas community into atlas communities adjacent to the entity to generate a plurality of i-th partitioned atlas communities comprises:
calculating a first modularity of a target entity of each i-1 th divided spectrum community, wherein the first modularity is the modularity of the target entity before being divided into adjacent spectrum communities, the modularity is used for representing the stability of the entity in the corresponding spectrum communities, and the target entity is any entity in each i-1 th divided spectrum community;
calculating a second modularity of the target entity, wherein the second modularity is the modularity of the target entity after being divided into adjacent map communities;
judging whether the first modularity of the target entity is smaller than the second modularity; and
And if the first modularity of the target entity is not less than the second modularity, generating an ith divided map community based on the target entity.
4. A training text collection method for a sensitive content quality inspection model according to any one of claims 1 to 3, wherein the step of performing a screening operation on the plurality of historical texts to obtain a plurality of training texts for training the sensitive content quality inspection model comprises:
clustering the plurality of historical texts through preset sensitive words to obtain a plurality of clustered text sets;
screening the plurality of clustering text sets according to the preset sensitive words to obtain target clusters; and
And taking a plurality of texts in the target cluster as the plurality of training texts.
5. A training text acquisition method for a sensitive content quality inspection model according to any one of claims 1 to 3, further comprising: the plurality of training texts is uploaded to a blockchain.
6. A training text acquisition system for a sensitive content quality inspection model, comprising:
the acquisition module is used for acquiring account data of a plurality of users and relationship data among the users so as to acquire the account data and the relationship data;
the construction module is used for constructing an account contact map according to the plurality of account data and the plurality of relation data;
the clustering module is used for clustering the account data based on the account contact map so as to obtain a plurality of user sets; the clustering module is further configured to: performing map partitioning operation on the account contact map based on a community partitioning algorithm to obtain a plurality of target map communities; generating a user set according to the user account number in each target graph community to obtain a plurality of user sets; the clustering module is further configured to: initializing the account contact map to divide each entity of the account contact map into a plurality of initial map communities; performing the ith partitioning operation: dividing each entity in each i-1 th divided graph community into graph communities adjacent to the entity to generate a plurality of i-th divided graph communities; i is a positive integer, and when i is 1, the i-1 th divided map community is the initial map community; when i is more than 1, the map communities divided in the ith-1 th time are map communities obtained by the ith-1 st time of dividing operation; performing the ith build operation: a plurality of ith constructed community networks constructed based on the plurality of ith partitioned graph communities, wherein each ith partitioned graph community corresponds to one ith constructed community network;
the selection module is used for selecting a sensitive account number set from the plurality of user sets, wherein the sensitive account number set comprises a plurality of sensitive users;
the acquisition module is used for acquiring historical texts of each sensitive user in a preset time window so as to obtain a plurality of historical texts; and
And the screening module is used for carrying out screening operation on the plurality of historical texts so as to obtain a plurality of training texts for training the sensitive content quality inspection model.
7. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program when executed by the processor implements the steps of the training text acquisition method for a sensitive content quality assurance model according to any one of claims 1 to 5.
8. A computer-readable storage medium, in which a computer program is stored, the computer program being executable by at least one processor to cause the at least one processor to perform the steps of the training text acquisition method for a sensitive content quality assurance model as claimed in any one of claims 1 to 5.
CN202110691208.2A 2021-06-22 2021-06-22 Training text acquisition method, system and equipment for sensitive content quality inspection model Active CN113420148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110691208.2A CN113420148B (en) 2021-06-22 2021-06-22 Training text acquisition method, system and equipment for sensitive content quality inspection model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110691208.2A CN113420148B (en) 2021-06-22 2021-06-22 Training text acquisition method, system and equipment for sensitive content quality inspection model

Publications (2)

Publication Number Publication Date
CN113420148A CN113420148A (en) 2021-09-21
CN113420148B true CN113420148B (en) 2024-02-09

Family

ID=77789814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110691208.2A Active CN113420148B (en) 2021-06-22 2021-06-22 Training text acquisition method, system and equipment for sensitive content quality inspection model

Country Status (1)

Country Link
CN (1) CN113420148B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110336838A (en) * 2019-08-07 2019-10-15 腾讯科技(武汉)有限公司 Account method for detecting abnormality, device, terminal and storage medium
CN111651741A (en) * 2020-06-05 2020-09-11 腾讯科技(深圳)有限公司 User identity recognition method and device, computer equipment and storage medium
CN111738628A (en) * 2020-08-14 2020-10-02 支付宝(杭州)信息技术有限公司 Risk group identification method and device
CN111767472A (en) * 2020-07-08 2020-10-13 吉林大学 Method and system for detecting abnormal account of social network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10984316B2 (en) * 2017-06-19 2021-04-20 International Business Machines Corporation Context aware sensitive information detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110336838A (en) * 2019-08-07 2019-10-15 腾讯科技(武汉)有限公司 Account method for detecting abnormality, device, terminal and storage medium
CN111651741A (en) * 2020-06-05 2020-09-11 腾讯科技(深圳)有限公司 User identity recognition method and device, computer equipment and storage medium
CN111767472A (en) * 2020-07-08 2020-10-13 吉林大学 Method and system for detecting abnormal account of social network
CN111738628A (en) * 2020-08-14 2020-10-02 支付宝(杭州)信息技术有限公司 Risk group identification method and device

Also Published As

Publication number Publication date
CN113420148A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
CN111614690B (en) Abnormal behavior detection method and device
CN111294812B (en) Resource capacity-expansion planning method and system
CN109995611B (en) Traffic classification model establishing and traffic classification method, device, equipment and server
CN112118551A (en) Equipment risk identification method and related equipment
CN115223022B (en) Image processing method, device, storage medium and equipment
CN111611519B (en) Method and device for detecting personal abnormal behaviors
CN110807546A (en) Community grid population change early warning method and system
CN111428653B (en) Pedestrian congestion state judging method, device, server and storage medium
CN111064719B (en) Method and device for detecting abnormal downloading behavior of file
CN110674832B (en) Method, device and terminal for identifying enterprise to which Internet user belongs
CN113420148B (en) Training text acquisition method, system and equipment for sensitive content quality inspection model
CN111800807A (en) Method and device for alarming number of base station users
CN108429632B (en) Service monitoring method and device
CN114970495A (en) Name disambiguation method and device, electronic equipment and storage medium
CN111242723B (en) User child and child condition judgment method, server and computer readable storage medium
CN113486211A (en) Account identification method and device, electronic equipment, storage medium and program product
CN112712115A (en) Network user group division method and system
CN114299043B (en) Point cloud quality evaluation method and device, electronic equipment and storage medium
CN113051128B (en) Power consumption detection method and device, electronic equipment and storage medium
CN111985568B (en) Data processing method and device and electronic equipment
CN111428251B (en) Data processing method and device
CN117809132A (en) Rule-breaking construction recognition model training method, rule-breaking construction recognition method and rule-breaking construction recognition device
CN117033731A (en) Merchant location data processing method, device, equipment and storage medium
CN118134626A (en) Credit line model evaluation method and device, readable storage medium and terminal equipment
CN116665128A (en) Image recognition method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant