CN113704414A - Data processing method, system, storage medium and electronic equipment - Google Patents

Data processing method, system, storage medium and electronic equipment Download PDF

Info

Publication number
CN113704414A
CN113704414A CN202111027813.6A CN202111027813A CN113704414A CN 113704414 A CN113704414 A CN 113704414A CN 202111027813 A CN202111027813 A CN 202111027813A CN 113704414 A CN113704414 A CN 113704414A
Authority
CN
China
Prior art keywords
preset
result
identification
content
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111027813.6A
Other languages
Chinese (zh)
Inventor
梁志勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202111027813.6A priority Critical patent/CN113704414A/en
Publication of CN113704414A publication Critical patent/CN113704414A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a data processing method, a data processing system, a storage medium and electronic equipment, wherein the acquired content to be audited is preprocessed to obtain content segments in a preset format and stored in a preset distributed message queue, the content segments in the preset distributed message queue are sent to a preset word bank retrieval model and a preset AI identification model for identification, a first identification result and a second identification result are obtained and audited, and the audit result is obtained and output. Through the method, the content to be audited is preprocessed to obtain the content fragments in the uniform format, and the content fragments are quickly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the quick retrieval of the preset word stock retrieval model is facilitated, and the auditing efficiency of the content information is improved. Through intelligent extension and self-learning of the preset word bank retrieval model and combination of the preset AI identification model, a dual identification mechanism is formed, and the identification rate of auditing the content information is improved.

Description

Data processing method, system, storage medium and electronic equipment
Technical Field
The present application relates to the field of data content auditing technologies, and in particular, to a data processing method, a data processing system, a storage medium, and an electronic device.
Background
In an interactive scene of the internet service platform, for example, multiple scenes such as user chatting, e-commerce comments, posts, messages and the like, content information is generated, and the content information includes advertisements, news and the like. Corresponding laws and regulations exist for the content of the internet, and if the content of the internet is violated, the website or APP can be modified or shut down from shelf. In order to meet the regulations and form a healthy content ecology, the content information generated by the internet service platform needs to be audited in real time.
In the prior art, the content is audited by adopting a mode of machine audit and human audit, the machine audit is primarily audited, suspected content is labeled, and the audit result is issued after manual audit. And the machine review mainly filters the target text by establishing a word bank, and if the target text is matched with the keywords in the word bank, the target text is judged to be the illegal text.
Because the word stock in the word stock comprises a plurality of large classes such as various advertisements and news, and the large classes comprise a plurality of small classes, the matching speed is low due to the fact that keywords are matched through the word stock, and when the keywords in various forms such as voice words and form-similar words are met, the recognition rate of auditing content information through the keywords is low.
Therefore, the prior art has low efficiency of auditing the content information and low recognition rate.
Disclosure of Invention
In view of this, the present application discloses a data processing method, system, storage medium and electronic device, which aim to improve the efficiency and recognition rate of auditing content information.
In order to achieve the purpose, the technical scheme is as follows:
a first aspect of the present application discloses a data processing method, including:
acquiring content to be audited;
preprocessing the content to be audited to obtain a content fragment with a preset format, and storing the content fragment to a preset distributed message queue;
respectively sending the content segments in the preset distributed message queue to a preset word bank retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word bank retrieval model is used for identifying sensitive words and corresponding risk types thereof; the preset AI identification model is used for identifying violation types;
and auditing the first identification result and the second identification result to obtain and output an auditing result.
Preferably, the preprocessing the content to be audited to obtain a content segment in a preset format includes:
if the preset characters exist in the content to be audited, removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters;
calculating the content to be checked without the preset characters through a preset semantic algorithm to obtain an original content fragment;
and carrying out syntax conversion on the original content fragment to obtain a content fragment with a preset format.
Preferably, the sending the content segments in the preset distributed message queue to a preset sensitive thesaurus retrieval model and a preset AI identification model respectively for identification to obtain a first identification result and a second identification result includes:
identifying cluster nodes corresponding to the preset word bank retrieval model in an idle state and cluster nodes corresponding to the preset AI identification model in the idle state;
respectively sending the content segments in the preset distributed message queue to cluster nodes corresponding to the preset word bank retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
Preferably, the auditing the first identification result and the second identification result to obtain and output an auditing result includes:
determining a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result;
judging a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result;
and/or if the first result type is a risky type and the second result type is a preset violation type, marking a label corresponding to the obtained verification result as a violation label and outputting the verification result marked with the violation label;
and/or if the first result type is the risky type and the second result type is a preset suspected violation type, labeling a label corresponding to the obtained verification result as a violation label, and outputting the verification result labeled with the violation label;
and/or if the first result type is the risky type and the second result type is a preset compliance type, labeling a label corresponding to the obtained auditing result as an illegal label, and outputting the auditing result labeled with the illegal label;
and/or if the first result type is a risk-free type and the second result type is the preset violation type, labeling a label corresponding to the obtained verification result as a violation label, and outputting the verification result labeled with the violation label;
and/or if the first result type is the risk-free type and the second result type is the preset suspected violation type, labeling a label corresponding to the obtained verification result as a suspected violation label, and outputting the verification result labeled with the suspected violation label;
and/or if the first result type is the risk-free type and the second result type is the preset compliance type, labeling a tag corresponding to the obtained checking result as a compliance tag, and outputting the checking result labeled with the compliance tag.
Preferably, the method further comprises the following steps:
and if the number of queues in the preset distributed message queue is greater than the preset number, performing dynamic capacity expansion operation on the preset distributed message queue.
A second aspect of the present application discloses a data processing system, the system comprising:
the acquisition unit is used for acquiring the content to be audited;
the preprocessing unit is used for preprocessing the content to be audited to obtain a content fragment with a preset format and storing the content fragment to a preset distributed message queue;
the identification unit is used for respectively sending the content segments in the preset distributed message queue to a preset word bank retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word bank retrieval model is used for identifying sensitive words and corresponding risk types thereof; the preset AI identification model is used for identifying violation types;
and the auditing unit is used for auditing the first identification result and the second identification result to obtain and output an auditing result.
Preferably, the preprocessing unit that preprocesses the content to be audited to obtain the content segment with the preset format includes:
the removing module is used for removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters if the preset characters are monitored to exist in the content to be audited;
the computing module is used for computing the content to be audited without the preset characters through a preset semantic algorithm to obtain an original content fragment;
and the conversion module is used for carrying out syntax conversion on the original content fragment to obtain the content fragment with the preset format.
Preferably, the identification unit includes:
the identification module is used for identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state;
the sending module is used for respectively sending the content segments in the preset distributed message queue to the cluster nodes corresponding to the preset word bank retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
A third aspect of the present application discloses a storage medium, which includes stored instructions, and when the instructions are executed, the storage medium controls a device in which the storage medium is located to execute the data processing method according to any one of the first aspect.
A fourth aspect of the present application discloses an electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the data processing method according to any one of the first aspect.
According to the technical scheme, the application discloses a data processing method, a system, a storage medium and electronic equipment, the content to be audited is obtained, the content to be audited is preprocessed to obtain content fragments in a preset format, the content fragments are stored in a preset distributed message queue, the content fragments in the preset distributed message queue are respectively sent to a preset word stock retrieval model and a preset AI identification model to be identified, a first identification result and a second identification result are obtained, the preset word stock retrieval model is used for identifying sensitive words and corresponding risk types of the sensitive words, the preset AI identification model is used for identifying violation types, the first identification result and the second identification result are audited, and audit results are obtained and output. Through the method, the content to be audited is preprocessed to obtain the content fragments in the uniform format, and the content fragments are quickly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the quick retrieval of the preset word stock retrieval model is facilitated, and the auditing efficiency of the content information is improved. Through intelligent extension and self-learning of the preset word bank retrieval model and combination of the preset AI identification model, a dual identification mechanism is formed, and the identification rate of auditing the content information is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flow chart of a data processing method disclosed in an embodiment of the present application;
fig. 2 is a schematic flowchart of a process of preprocessing a content to be checked to obtain a content segment with a preset format, which is disclosed in an embodiment of the present application;
fig. 3 is a schematic flowchart of a process for obtaining a first recognition result and a second recognition result disclosed in an embodiment of the present application;
FIG. 4 is a block diagram of a data processing system according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an electronic device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
As can be seen from the background art, the existing auditing of content information is inefficient and has a low recognition rate.
In order to solve the above problems, an embodiment of the present application discloses a data processing method, a system, a storage medium, and an electronic device, where content to be checked is preprocessed to obtain content segments in a uniform format, and the content segments are quickly sent to a preset lexicon search system and a preset AI identification model by using a highly available preset distributed message queue, so as to facilitate quick search of the preset lexicon search model and improve efficiency of checking content information. Through intelligent extension and self-learning of the preset word bank retrieval model and combination of the preset AI identification model, a dual identification mechanism is formed, and the identification rate of auditing the content information is improved. The specific implementation is specifically illustrated by the following examples.
Referring to fig. 1, a schematic flow chart of a data processing method disclosed in an embodiment of the present application is shown, where the data processing method mainly includes the following steps:
s101: and acquiring the content to be checked.
In S101, the content to be checked collects content information input by the user from the internet service platform.
In an interactive scene of an internet service platform, for example, multiple scenes such as user chatting, e-commerce comments, posts, messages, barrage and the like, content information, namely, content to be audited is generated, and the content to be audited includes advertisement content, news content, chatting content and the like.
S102: and preprocessing the content to be checked to obtain content fragments in a preset format, and storing the content fragments to a preset distributed message queue.
In S102, the preprocessing acts on the content to be checked to obtain the content segments in the preset format (uniform format), so as to facilitate subsequent fast retrieval of the content segments in the preset format.
Due to the characteristics of high performance, high availability and the like of the preset distributed message queue, the content fragments in the preset format are received in real time through the preset distributed message queue for caching.
Specifically, the process of preprocessing the content to be audited to obtain the content fragment with the preset format is as follows:
firstly, if the preset characters exist in the content to be audited, removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters.
The preset characters comprise punctuation marks, special characters, webpage labels, stop words and the like.
And then, calculating the content to be checked without the preset characters through a preset semantic algorithm to obtain an original content fragment.
The method comprises the steps of carrying out semantic word segmentation on the content to be checked without preset characters through a preset semantic algorithm to obtain original content fragments.
The contents to be checked without the preset characters are calculated through a preset semantic algorithm, so that the misinformation and the false interception of the contents to be checked are avoided.
The preset semantic algorithm may be a natural semantic algorithm or other semantic algorithms, and the specific preset semantic algorithm is determined by a technician according to actual conditions, which is not specifically limited in the present application.
And finally, carrying out syntax conversion on the original content fragment to obtain the content fragment with the preset format.
The original content segments are subjected to Chinese simplified and traditional conversion and English capital and small case conversion, and are uniformly converted into Chinese simplified and English lowercase formats, namely the content segments in the preset format.
S103: respectively sending the content segments in the preset distributed message queue to a preset word bank retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word bank retrieval model is used for identifying the sensitive words and the corresponding risk types thereof; the preset AI identification model is used for identifying violation types.
In S103, according to the clustering performance of the preset thesaurus retrieval model and the clustering performance of the preset Artificial Intelligence (AI) recognition model, the content segments in the preset distributed message queue are intelligently scheduled, and the content segments in the preset distributed message queue are synchronized to the preset thesaurus retrieval model and the preset AI recognition model, respectively.
And intelligently scheduling the content segments in the preset distributed message queue according to the cluster performance of the preset word bank retrieval model and the cluster performance of the preset AI identification model, namely identifying the content segments in the preset distributed message queue through the cluster nodes corresponding to the preset word bank retrieval model in an idle state and the cluster nodes corresponding to the preset AI identification model in an idle state.
Self-learning is carried out in a preset word stock retrieval model to form a comprehensive word stock. The specific self-learning comprises supervision file self-learning, internet new noun monitoring, deformation word bank self-learning, business personnel feedback, distributed retrieval and real-time monitoring.
The supervision file self-learns: self-learning is carried out through industry requirements, laws and regulations and the like, such as forbidden words specified by an advertising law and forbidden words published by a web letter office are followed up in time and a word bank is revised. The operator can adjust and modify the self-learning result and input the result into a word bank.
Monitoring Internet new nouns: and mining the new words based on a word similarity algorithm, marking and confirming through operators, and inputting a word bank.
Self-learning of a deformed word bank: and synchronously expanding homophones, interference times and form-similar words based on the basic word bank to form a new word bank independent of the standard word bank.
And (3) service personnel feedback: and according to the mistaken killing and missed killing cases fed back in the service use process, the word stock is revised in a targeted manner.
Distributed retrieval: and matching in a plurality of different word banks until the result is synchronized to the decision system. Through the various mechanisms for updating the word stock in real time, the latest compliance requirements and the latest deformed words can be accurately identified.
And (3) real-time monitoring: the word bank is expanded by monitoring policy files, violation events and Internet new nouns in real time and then by self-learning.
The marking, training and calling processes of the preset AI recognition model are as follows:
labeling samples: according to the requirement of content violation identification on business, common violation classifications of the text are defined firstly, wherein the classes include advertisement, negative and the like, and business data of the common violation classifications are labeled.
Model training: and (3) using the marked business data, adding a marked data set disclosed by an open source in the industry, extracting the characteristics of the marked business data, entering a model training link, converting the characteristic into high-dimensional vector representation through a neural network embedding layer, and finally obtaining a model classification output result through operations such as convolution, pooling, full connection and the like, wherein the model classification output result is used for indicating the marked business data of violation classification.
Model calling: and presetting content segments in the distributed message queue for calling and identifying.
Specifically, the content segments in the preset distributed message queue are respectively sent to a preset word bank retrieval model and a preset AI identification model for identification, and the process of obtaining a first identification result and a second identification result is as follows:
firstly, identifying cluster nodes corresponding to a preset word bank retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state.
Then, respectively sending the content segments in the preset distributed message queue to cluster nodes corresponding to a preset word bank retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in an idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
The preset distributed message queue has the characteristics of high performance, high availability and the like, and the content fragments are quickly scheduled in real time and sent to the preset word bank retrieval model and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the processing efficiency of the content fragments is improved.
Synchronizing the content segments with the preset format to a preset distributed content message queue, and if the number of queues in the preset distributed message queue is greater than the preset number, performing dynamic capacity expansion operation on the preset distributed message queue.
When the performance of the preset distributed message queue is reduced or the performance is in stable recovery, dynamic capacity expansion can be carried out according to the content data concurrency amount.
The preset number may be 50, 100, etc., and the determination of the specific preset number is set by a skilled person according to practical situations, and the application is not particularly limited.
S104: and auditing the first identification result and the second identification result to obtain and output an auditing result.
In S104, a first result type of the first recognition result and a second result type corresponding to the second recognition result are comprehensively audited, that is, the first result type of the first recognition result and the second result type corresponding to the second recognition result are determined as the type corresponding to the audit result, the type corresponding to the audit result is labeled, and finally, the labeled audit result is output.
The first result type comprises a risk type and a risk-free type, the second result type comprises a preset violation type, a preset suspected violation type and a preset compliance type, and the preset compliance type is used for indicating a type meeting regulations.
The marked auditing results comprise an auditing result marked with an illegal label, an auditing result marked with a suspected illegal label and an auditing result marked with a compliance label.
The process of auditing the first identification result and the second identification result to obtain and output the auditing result is as follows:
and determining a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result.
And judging a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result.
And/or if the first result type is a risky type and the second result type is a preset violation type, marking a label corresponding to the obtained auditing result as a violation label and outputting the auditing result marked with the violation label.
And/or if the first result type is a risky type and the second result type is a preset suspected violation type, marking a label corresponding to the obtained verification result as a violation label and outputting the verification result marked with the violation label.
And/or if the first result type is a risky type and the second result type is a preset compliance type, marking a label corresponding to the obtained auditing result as an illegal label, and outputting the auditing result marked with the illegal label.
And/or if the first result type is a risk-free type and the second result type is a preset violation type, marking a label corresponding to the obtained auditing result as a violation label and outputting the auditing result marked with the violation label.
And/or if the first result type is a risk-free type and the second result type is a preset suspected violation type, labeling a label corresponding to the obtained verification result as a suspected violation label, and outputting the verification result labeled with the suspected violation label.
And/or if the first result type is a risk-free type and the second result type is a preset compliance type, labeling the label corresponding to the obtained auditing result as a compliance label, and outputting the auditing result labeled with the compliance label.
In the embodiment of the application, the content to be audited is preprocessed to obtain the content fragments in a uniform format, and the content fragments are quickly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the quick retrieval of the preset word stock retrieval model is facilitated, and the auditing efficiency of the content information is improved. Through intelligent extension and self-learning of the preset word bank retrieval model and combination of the preset AI identification model, a dual identification mechanism is formed, and the identification rate of auditing the content information is improved.
Referring to fig. 2, a process involved in the foregoing S102 of preprocessing the content to be checked to obtain the content fragment in the preset format mainly includes the following steps:
s201: and if the preset characters exist in the content to be audited, removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters.
S202: and calculating the content to be checked without the preset characters through a preset semantic algorithm to obtain an original content fragment.
S203: and carrying out syntax conversion on the original content fragment to obtain the content fragment with the preset format.
The execution principle of S201-S203 is consistent with the execution principle of S102, and it can be referred to, and is not described herein again.
In the embodiment of the application, the preset characters in the content to be audited are removed to obtain the content to be audited without the preset characters, the content to be audited without the preset characters is calculated through a preset semantic algorithm to obtain the original content segment, the original content segment is subjected to syntax conversion to obtain the content segment in the preset format, and the original content segment is subjected to syntax conversion, so that the content segment in the preset format does not have contents in other complex formats, and the content segment in the preset format is conveniently and quickly retrieved in a subsequent processing process.
Referring to fig. 3, for the process related to the step S103 of sending the content segments in the preset distributed message queue to the preset sensitive thesaurus retrieval model and the preset AI recognition model respectively for recognition to obtain the first recognition result and the second recognition result, the method mainly includes the following steps:
s301: and identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state.
S302: respectively sending content segments in a preset distributed message queue to cluster nodes corresponding to a preset word bank retrieval model in an idle state and cluster nodes corresponding to a preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in an idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
The execution principle of S301-S302 is consistent with the execution principle of S103, and it can be referred to, and is not described herein again.
In the embodiment of the application, the cluster node corresponding to the preset word bank retrieval model in the idle state and the cluster node corresponding to the preset AI identification model in the idle state are identified, and the content segments in the preset distributed message queue are respectively sent to the cluster node corresponding to the preset word bank retrieval model in the idle state and the cluster node corresponding to the preset AI identification model in the idle state for identification, so that the purpose of obtaining the first identification result and the second identification result is achieved.
Based on the data processing method disclosed in fig. 1 in the foregoing embodiment, an embodiment of the present application further discloses a schematic structural diagram of a data processing system, and as shown in fig. 4, the data processing system includes:
an obtaining unit 401 is configured to obtain content to be audited.
The preprocessing unit 402 is configured to preprocess the content to be checked, obtain a content segment in a preset format, and store the content segment in a preset distributed message queue.
The identifying unit 403 is configured to send content segments in the preset distributed message queue to the preset lexicon retrieval model and the preset AI identification model respectively for identification, so as to obtain a first identification result and a second identification result; the preset word bank retrieval model is used for identifying the sensitive words and the corresponding risk types thereof; the preset AI identification model is used for identifying violation types.
The auditing unit 404 is configured to perform auditing processing on the first identification result and the second identification result, obtain an auditing result, and output the auditing result.
Furthermore, the preprocessing unit for preprocessing the content to be checked to obtain the content segments with the preset format comprises a removing module, a calculating module and a converting module.
And the removing module is used for removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters if the preset characters are monitored to exist in the content to be audited.
And the calculation module is used for calculating the content to be audited without the preset characters through a preset semantic algorithm to obtain the original content fragment.
And the conversion module is used for carrying out syntax conversion on the original content fragment to obtain the content fragment with the preset format.
Further, the identification unit 403 includes an identification module and a sending module.
And the identification module is used for identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state.
The sending module is used for respectively sending the content segments in the distributed message queue to the cluster nodes corresponding to the preset word bank retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in an idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
Further, the auditing unit 404 includes a determining module and a determining module.
And the determining module is used for determining a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result.
And the judging module is used for judging a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result.
And/or, the auditing unit 404 includes a first annotation module.
And the first labeling module is used for labeling the label corresponding to the obtained auditing result as the violation label and outputting the auditing result labeled with the violation label if the first result type is the risky type and the second result type is the preset violation type.
And/or, the auditing unit 404 includes a second annotation module.
And the second labeling module is used for labeling the label corresponding to the obtained verification result as the violation label and outputting the verification result labeled with the violation label if the first result type is the risky type and the second result type is the preset suspected violation type.
And/or, the auditing unit 404 includes a third tagging module.
And the third labeling module is used for labeling the label corresponding to the obtained auditing result as an illegal label and outputting the auditing result labeled with the illegal label if the first result type is the risky type and the second result type is the preset compliance type.
And/or, the auditing unit 404 includes a fourth labeling module.
And the fourth labeling module is used for labeling the label corresponding to the obtained verification result as the illegal label and outputting the verification result labeled with the illegal label if the first result type is the risk-free type and the second result type is the preset illegal type.
And/or, the auditing unit 404 includes a fifth tagging module.
And the fifth labeling module is used for labeling the label corresponding to the obtained verification result as the suspected violation label and outputting the verification result labeled with the suspected violation label if the first result type is the risk-free type and the second result type is the preset suspected violation type.
And/or, the auditing unit 404 includes a sixth annotation module.
And the sixth marking module is used for marking the label corresponding to the obtained checking result as a compliance label and outputting the checking result marked with the compliance label if the first result type is the risk-free type and the second result type is the preset compliance type.
Furthermore, the device also comprises a capacity expansion unit.
And the capacity expansion unit is used for carrying out dynamic capacity expansion operation on the preset distributed message queue if the number of queues in the preset distributed message queue is greater than the preset number.
In the embodiment of the application, the content to be audited is preprocessed to obtain the content fragments in a uniform format, and the content fragments are quickly sent to the preset word stock retrieval system and the preset AI identification model by adopting the high-availability preset distributed message queue, so that the quick retrieval of the preset word stock retrieval model is facilitated, and the auditing efficiency of the content information is improved. Through intelligent extension and self-learning of the preset word bank retrieval model and combination of the preset AI identification model, a dual identification mechanism is formed, and the identification rate of auditing the content information is improved.
The embodiment of the application also provides a storage medium, which comprises stored instructions, wherein when the instructions are executed, the device where the storage medium is located is controlled to execute the data processing method.
The embodiment of the present application further provides an electronic device, which has a schematic structural diagram as shown in fig. 5, and specifically includes a memory 501 and one or more instructions 502, where the one or more instructions 502 are stored in the memory 501, and are configured to be executed by one or more processors 503 to perform the data processing method.
The specific implementation procedures and derivatives thereof of the above embodiments are within the scope of the present application.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing is only a preferred embodiment of the present application and it should be noted that those skilled in the art can make several improvements and modifications without departing from the principle of the present application, and these improvements and modifications should also be considered as the protection scope of the present application.

Claims (10)

1. A method of data processing, the method comprising:
acquiring content to be audited;
preprocessing the content to be audited to obtain a content fragment with a preset format, and storing the content fragment to a preset distributed message queue;
respectively sending the content segments in the preset distributed message queue to a preset word bank retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word bank retrieval model is used for identifying sensitive words and corresponding risk types thereof; the preset AI identification model is used for identifying violation types;
and auditing the first identification result and the second identification result to obtain and output an auditing result.
2. The method according to claim 1, wherein the preprocessing the content to be audited to obtain a content segment with a preset format includes:
if the preset characters exist in the content to be audited, removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters;
calculating the content to be checked without the preset characters through a preset semantic algorithm to obtain an original content fragment;
and carrying out syntax conversion on the original content fragment to obtain a content fragment with a preset format.
3. The method according to claim 1, wherein the sending the content segments in the preset distributed message queue to a preset sensitive thesaurus retrieval model and a preset AI recognition model for recognition respectively to obtain a first recognition result and a second recognition result comprises:
identifying cluster nodes corresponding to the preset word bank retrieval model in an idle state and cluster nodes corresponding to the preset AI identification model in the idle state;
respectively sending the content segments in the preset distributed message queue to cluster nodes corresponding to the preset word bank retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
4. The method according to claim 1, wherein the auditing the first recognition result and the second recognition result to obtain and output an auditing result, includes:
determining a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result;
judging a first result type corresponding to the first recognition result and a second result type corresponding to the second recognition result;
and/or if the first result type is a risky type and the second result type is a preset violation type, marking a label corresponding to the obtained verification result as a violation label and outputting the verification result marked with the violation label;
and/or if the first result type is the risky type and the second result type is a preset suspected violation type, labeling a label corresponding to the obtained verification result as a violation label, and outputting the verification result labeled with the violation label;
and/or if the first result type is the risky type and the second result type is a preset compliance type, labeling a label corresponding to the obtained auditing result as an illegal label, and outputting the auditing result labeled with the illegal label;
and/or if the first result type is a risk-free type and the second result type is the preset violation type, labeling a label corresponding to the obtained verification result as a violation label, and outputting the verification result labeled with the violation label;
and/or if the first result type is the risk-free type and the second result type is the preset suspected violation type, labeling a label corresponding to the obtained verification result as a suspected violation label, and outputting the verification result labeled with the suspected violation label;
and/or if the first result type is the risk-free type and the second result type is the preset compliance type, labeling a tag corresponding to the obtained checking result as a compliance tag, and outputting the checking result labeled with the compliance tag.
5. The method of claim 1, further comprising:
and if the number of queues in the preset distributed message queue is greater than the preset number, performing dynamic capacity expansion operation on the preset distributed message queue.
6. A data processing system, characterized in that the system comprises:
the acquisition unit is used for acquiring the content to be audited;
the preprocessing unit is used for preprocessing the content to be audited to obtain a content fragment with a preset format and storing the content fragment to a preset distributed message queue;
the identification unit is used for respectively sending the content segments in the preset distributed message queue to a preset word bank retrieval model and a preset AI identification model for identification to obtain a first identification result and a second identification result; the preset word bank retrieval model is used for identifying sensitive words and corresponding risk types thereof; the preset AI identification model is used for identifying violation types;
and the auditing unit is used for auditing the first identification result and the second identification result to obtain and output an auditing result.
7. The system according to claim 6, wherein the preprocessing unit that preprocesses the content to be audited to obtain a content segment with a preset format comprises:
the removing module is used for removing the preset characters in the content to be audited to obtain the content to be audited without the preset characters if the preset characters are monitored to exist in the content to be audited;
the computing module is used for computing the content to be audited without the preset characters through a preset semantic algorithm to obtain an original content fragment;
and the conversion module is used for carrying out syntax conversion on the original content fragment to obtain the content fragment with the preset format.
8. The system of claim 6, wherein the identification unit comprises:
the identification module is used for identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state and cluster nodes corresponding to the preset AI identification model in the idle state;
the sending module is used for respectively sending the content segments in the preset distributed message queue to the cluster nodes corresponding to the preset word bank retrieval model in the idle state and the cluster nodes corresponding to the preset AI identification model in the idle state for identification to obtain a first identification result and a second identification result; the first identification result is obtained by identifying cluster nodes corresponding to the preset word bank retrieval model in the idle state; and the second identification result is obtained by identifying the cluster node corresponding to the preset AI identification model in the idle state.
9. A storage medium comprising stored instructions, wherein the instructions, when executed, control a device on which the storage medium resides to perform a data processing method according to any one of claims 1 to 5.
10. An electronic device comprising a memory, and one or more instructions, wherein the one or more instructions are stored in the memory and configured to be executed by the one or more processors to perform the data processing method of any one of claims 1 to 5.
CN202111027813.6A 2021-09-02 2021-09-02 Data processing method, system, storage medium and electronic equipment Pending CN113704414A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111027813.6A CN113704414A (en) 2021-09-02 2021-09-02 Data processing method, system, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111027813.6A CN113704414A (en) 2021-09-02 2021-09-02 Data processing method, system, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN113704414A true CN113704414A (en) 2021-11-26

Family

ID=78658905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111027813.6A Pending CN113704414A (en) 2021-09-02 2021-09-02 Data processing method, system, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN113704414A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306619A (en) * 2023-05-17 2023-06-23 北京拓普丰联信息科技股份有限公司 Document detection method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090249419A1 (en) * 2008-03-25 2009-10-01 Kahn Brian E Method and System of Queued Management of Multimedia Storage
US20150296229A1 (en) * 2014-04-10 2015-10-15 Telibrahma Convergent Communications Private Limited Method and system for auditing multimedia content
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
CN109831751A (en) * 2019-01-04 2019-05-31 上海创蓝文化传播有限公司 A kind of short message content air control system and method based on natural language processing
CN110674255A (en) * 2019-09-24 2020-01-10 湖南快乐阳光互动娱乐传媒有限公司 Text content auditing method and device
US10810373B1 (en) * 2018-10-30 2020-10-20 Oath Inc. Systems and methods for unsupervised neologism normalization of electronic content using embedding space mapping
CN112507936A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Image information auditing method and device, electronic equipment and readable storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090249419A1 (en) * 2008-03-25 2009-10-01 Kahn Brian E Method and System of Queued Management of Multimedia Storage
US20150296229A1 (en) * 2014-04-10 2015-10-15 Telibrahma Convergent Communications Private Limited Method and system for auditing multimedia content
CN108647309A (en) * 2018-05-09 2018-10-12 达而观信息科技(上海)有限公司 Chat content checking method based on sensitive word and system
US10810373B1 (en) * 2018-10-30 2020-10-20 Oath Inc. Systems and methods for unsupervised neologism normalization of electronic content using embedding space mapping
CN109831751A (en) * 2019-01-04 2019-05-31 上海创蓝文化传播有限公司 A kind of short message content air control system and method based on natural language processing
CN110674255A (en) * 2019-09-24 2020-01-10 湖南快乐阳光互动娱乐传媒有限公司 Text content auditing method and device
CN112507936A (en) * 2020-12-16 2021-03-16 平安银行股份有限公司 Image information auditing method and device, electronic equipment and readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116306619A (en) * 2023-05-17 2023-06-23 北京拓普丰联信息科技股份有限公司 Document detection method and device, electronic equipment and storage medium
CN116306619B (en) * 2023-05-17 2023-08-25 北京拓普丰联信息科技股份有限公司 Document detection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111897970B (en) Text comparison method, device, equipment and storage medium based on knowledge graph
US11593671B2 (en) Systems and methods for semantic analysis based on knowledge graph
CN108875059B (en) Method and device for generating document tag, electronic equipment and storage medium
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN112686022A (en) Method and device for detecting illegal corpus, computer equipment and storage medium
US10657603B1 (en) Intelligent routing control
CN111814482B (en) Text key data extraction method and system and computer equipment
CN110147540B (en) Method and system for generating business security requirement document
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
WO2021244639A1 (en) Auxiliary implementation method and apparatus for online prediction using machine learning model
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN113568841A (en) Risk detection method, device and equipment for applet
CN113742592A (en) Public opinion information pushing method, device, equipment and storage medium
CN107545505A (en) Insure recognition methods and the system of finance product information
CN114491034B (en) Text classification method and intelligent device
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
CN113704414A (en) Data processing method, system, storage medium and electronic equipment
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN113297287B (en) Automatic user policy deployment method and device and electronic equipment
CN112989043B (en) Reference resolution method, reference resolution device, electronic equipment and readable storage medium
CN112417996A (en) Information processing method and device for industrial drawing, electronic equipment and storage medium
CN110705257B (en) Media resource identification method and device, storage medium and electronic device
CN116795978A (en) Complaint information processing method and device, electronic equipment and medium
CN113706207B (en) Order success rate analysis method, device, equipment and medium based on semantic analysis
CN115080744A (en) Data processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination