CN114329112A - Content auditing method and device, electronic equipment and storage medium - Google Patents

Content auditing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114329112A
CN114329112A CN202111599306.XA CN202111599306A CN114329112A CN 114329112 A CN114329112 A CN 114329112A CN 202111599306 A CN202111599306 A CN 202111599306A CN 114329112 A CN114329112 A CN 114329112A
Authority
CN
China
Prior art keywords
word
text
level
sensitive
result
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111599306.XA
Other languages
Chinese (zh)
Inventor
吴俊清
刘芳彤
汪一鸣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinao Xinzhi Technology Co ltd
Original Assignee
Xinao Xinzhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinao Xinzhi Technology Co ltd filed Critical Xinao Xinzhi Technology Co ltd
Priority to CN202111599306.XA priority Critical patent/CN114329112A/en
Publication of CN114329112A publication Critical patent/CN114329112A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to the technical field of natural language processing, in particular to a content auditing method, a device, electronic equipment and a storage medium, wherein the method comprises the following steps: searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result; semantic level division is carried out on the text to be audited by utilizing a word segmentation method to obtain a divided word level text, and sensitive word comparison is carried out on the divided word level text and the search matching result to obtain a comparison result; and fusing based on the comparison result, calculating the confidence coefficient of the fused sensitive words, and obtaining the content verification result according to the confidence coefficient of the sensitive words. Therefore, the problems that the auditing efficiency is low, the error is large, the auditing cost is increased and the like caused by adopting a manual auditing mode to audit the text content in the related technology are solved.

Description

Content auditing method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a content auditing method and apparatus, an electronic device, and a storage medium.
Background
With the rapid development of social life, data of a network platform and a social media platform are increased in a blowout manner, so that how to better supervise and purify a network world provides a green and mild network life for people, and the problem to be solved urgently is formed.
In the related art, text content review is usually performed by means of manual review. However, in the related art, the manual review mode for the text content greatly increases the time required for the review, reduces the review efficiency, and has large error of the manual review and high review cost.
Disclosure of Invention
The application provides a content auditing method and device based on character-level and word-level fusion, electronic equipment and a storage medium, which are used for solving the problems of low auditing efficiency, large error, increased auditing cost and the like caused by adopting a manual auditing mode to audit text content in the related technology.
The embodiment of the first aspect of the application provides a content auditing method based on character-level and word-level fusion, which comprises the following steps: searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result; performing semantic level division on the text to be audited by using a word segmentation method to obtain a divided word-level text, and performing sensitive word comparison on the divided word-level text and the search matching result to obtain a comparison result; and fusing based on the comparison result, calculating the confidence of the fused sensitive words, and obtaining a content verification result according to the confidence of the sensitive words.
Further, still include: and performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions.
Further, the searching and matching the text to be audited according to the granularity of the character level to obtain a searching and matching result includes: constructing a search tree for searching for matches; and extracting a public prefix of the character string of the text to be audited, and obtaining the search matching result by utilizing the search tree based on the public prefix.
Further, the fusing based on the comparison result and calculating the confidence of the fused sensitive word include: when the comparison results are the same and hit, the confidence of the sensitive word is 1; when the length of a sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word-level text is hit, the confidence coefficient of the sensitive word is 0.2; and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.
Further, still include: acquiring the service requirement of a user; and obtaining corresponding individual sensitive words according to the service requirements, and adding the individual sensitive words into a sensitive word bank for searching and matching.
The embodiment of the second aspect of the present application provides a content auditing device based on character-level and word-level fusion, including: the matching module is used for searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result; the division module is used for performing semantic level division on the text to be audited by utilizing a word segmentation method to obtain a divided word-level text, and performing sensitive word comparison on the divided word-level text and the search matching result to obtain a comparison result; and the fusion module is used for fusing based on the comparison result, calculating the confidence coefficient of the fused sensitive words and obtaining a content verification result according to the confidence coefficient of the sensitive words.
Further, still include: the preprocessing module is used for performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions; the management module is used for acquiring the service requirement of a user, obtaining the corresponding individual sensitive words according to the service requirement, and adding the individual sensitive words into a sensitive word bank for searching and matching.
Further, the matching module is used for constructing a search tree for searching matching; and extracting a public prefix of the character string of the text to be audited, and obtaining the search matching result by utilizing the search tree based on the public prefix.
Further, the fusion module is configured to, when the comparison results are the same and hit, determine that the confidence of the sensitive word is 1; when the length of a sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word-level text is hit, the confidence coefficient of the sensitive word is 0.2; and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.
An embodiment of a third aspect of the present application provides an electronic device, including: the content auditing method based on character-level and word-level fusion is characterized by comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the content auditing method based on character-level and word-level fusion.
A fourth aspect of the present application provides a computer-readable storage medium storing computer instructions for causing a computer to execute a content auditing method based on character-level and word-level fusion as described in the above embodiments.
Therefore, the application has at least the following beneficial effects:
the method can automatically audit the text content based on the character level and word level fusion of the dynamic word bank, effectively reduce the time and energy consumed by manual audit, improve the audit efficiency, reduce the audit cost, ensure the accuracy of the word level of the fused result, increase the semantic information of the word level, ensure the audit result to be more reasonable, and greatly reduce the audit error. Therefore, the problems that the auditing efficiency is low, the error is large, the auditing cost is increased and the like caused by adopting a manual auditing mode to audit the text content in the related technology are solved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a schematic flow chart of a content auditing method based on character-level and word-level fusion according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of a content auditing method based on character-level and word-level fusion according to an embodiment of the present application;
fig. 3 is a diagram of an example of a content auditing apparatus based on character-level and word-level fusion according to an embodiment of the present application;
fig. 4 is an exemplary diagram of an electronic device provided according to an embodiment of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
In the related art, the text content auditing method comprises the following modes:
(1) and directly performing sensitive word matching on the text to be audited, namely, the words in the text to be audited, which are the same as the words in the sensitive word bank, are hit. However, the matching mode is limited by the size of the sensitive word bank, and semantic level information is not considered, so that the problem of low flexibility of the sensitive word bank exists in content verification; the checked text only considers the matching with the same words, but ignores the original semantic content, and causes the expression of the checking result distortion; and when the comparison of a large number of sensitive word banks is faced, the problem of too low response speed often exists.
(2) Performing vector conversion on the text to be audited, comparing the similarity of the mapped semantic vector with the mapped sensitive words, and determining the text as a non-compliant text when a specified threshold value is reached; however, the calculation method based on the similarity depends on a large amount of training data and an accurate training model, the risk of missing report and false report exists in the auditing result, and the trained model is difficult to achieve strong robustness.
In order to solve the problems, simplify a text auditing method, improve the validity and efficiency of the text auditing, and reduce the time and energy consumed by manual auditing, the embodiment of the application provides a content auditing method and device based on character level and word level fusion, an electronic device and a storage medium, so that auditing errors are reduced, and the manual auditing cost is reduced.
A content auditing method, apparatus, electronic device, and storage medium based on character-level and word-level fusion according to embodiments of the present application will be described below with reference to the accompanying drawings. In the method, the text content can be automatically audited based on the character level and word level fusion of the dynamic word bank, so that the time and the energy consumed by manual audit are effectively reduced, the audit efficiency is improved, the audit cost is reduced, the accuracy of the word level is guaranteed due to the fused result, the semantic information of the word level is increased, the audit result is more reasonable, and the audit error is greatly reduced. Therefore, the problems that the auditing efficiency is low, the error is large, the auditing cost is increased and the like caused by adopting a manual auditing mode to audit the text content in the related technology are solved.
Specifically, fig. 1 is a schematic flowchart of a content auditing method based on character-level and word-level fusion according to an embodiment of the present application.
As shown in fig. 1, the content auditing method based on character-level and word-level fusion includes the following steps:
in step S101, the text to be audited is searched and matched according to the granularity of the character level, so as to obtain a search matching result.
Wherein, the characters refer to font-like units or symbols, including letters, numbers, operation symbols, punctuation marks and other symbols, and some functional symbols. A character is a general term for letters, numbers and symbols in electronic computers or radio communications, and is the smallest unit of data access in a data structure, and usually a character is represented by 8 binary bits (one byte). Characters are the form of binary coding that is often used in computers, and are also the most common form of information used in computers.
Wherein, the large granularity represents macroscopic and general; small particle size means more microscopic, focused on detail.
In this embodiment, the search matching is performed on the text to be audited according to the granularity of the character level to obtain a search matching result, which includes: constructing a search tree for searching for matches; and extracting the public prefix of the character string of the text to be audited, and obtaining a search matching result by utilizing a search tree based on the public prefix.
It can be understood that, since the search speed in the related art depends on the size of the sensitive word stock, if a database storage mode is directly adopted, not only can the space overhead be increased, but also once the quantity of the word stocks is increased, the response speed of the examination and matching can be directly reduced; therefore, the search tree can be constructed, the public prefix of the character string is utilized to reduce the cost of query time, the response time is reduced, and the auditing period is shortened.
In this embodiment, in the embodiment of the present application, a search tree may be constructed by using two word libraries, namely, a basic sensitive word library and a custom sensitive word library, so as to reduce space overhead, specifically: the sensitive word bank can be constructed by adopting a tree structure, and a multi-branch tree with a common prefix is finally formed, so that the form greatly improves the response speed of audit matching and is suitable for a large amount of data search scenes; therefore, the search tree based on the sensitive word bank can be constructed by using the property of the common prefix of the tree model, and the common prefix of the character string is used for reducing the cost of query time, so that the response time of text review is greatly shortened, and the text review has better effectiveness and efficiency.
In this embodiment, the method of the embodiment of the present application further includes: acquiring the service requirement of a user; and obtaining corresponding individual sensitive words according to the service requirements, and adding the individual sensitive words into a sensitive word bank for searching and matching.
It can be understood that the user can customize the personalized sensitive word stock, namely the customized sensitive word stock according to the business requirements, thereby solving the problem that the direct matching mode is limited by the fixed word stock, enabling the text auditing to be more flexible, and the user can dynamically increase and modify the customized word stock according to the personalized business requirements, perfecting the universal word stock, thereby enhancing the adaptability of the auditing effect to the business flexibility. The adding, deleting, modifying and searching operations of the user-defined sensitive word bank are automatically synchronized to the change of the tree nodes, so that the searching efficiency of the search tree can be ensured, and the searching time is reduced.
In this embodiment, the method of the embodiment of the present application further includes: and performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions.
Wherein the pre-processing may comprise: unifying English letters in the text to be audited into lower case, unifying traditional characters in the text into simplified characters, deleting characters in the text, removing conventional stop words and the like.
The audit conditions may be specifically set according to actual audit requirements, which is not specifically limited.
In step S102, semantic level division is performed on the text to be audited by using a word segmentation method to obtain a divided word-level text, and sensitive word comparison is performed on the divided word-level text and the search matching result to obtain a comparison result.
The words are the combined names of words and phrases, including words (including words and compound words) and phrases (also called phrases), and form the minimum word-forming structural form unit of the sentence article.
It can be understood that, in the embodiment of the application, semantic level division can be performed on the text to be audited by using a word segmentation technology, and the divided text is compared with the directly matched result by using the word as the minimum granularity to obtain the comparison result.
In step S103, fusion is performed based on the comparison result, and a confidence of the fused sensitive word is calculated, so as to obtain a content review result.
It can be understood that, after the comparison result is obtained, the inclusion relationship between the two can be calculated, and the inclusion relationship is used as a basis for fusing the auditing result, so that the fused result not only ensures the accuracy of the word level, but also increases the semantic information of the word level, and the auditing result is more reasonable; and then, assigning scores according to the business rules, and pushing the fused auditing results to the user as confidence degree judgment.
In this embodiment, performing fusion based on the comparison result, and calculating the confidence of the fused sensitive word includes: when the comparison results are the same and hit, the confidence coefficient of the sensitive word is 1; when the length of the sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word is hit, the confidence coefficient of the sensitive word is 0.2; and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.
It can be understood that if the hit sensitive word is completely overlapped with the word after word segmentation, the confidence of the sensitive word is 1; if the hit sensitive word is the proper subset of the word after word segmentation or is not completely contained in the word after word segmentation, the confidence coefficient of the sensitive word is 0.2; the confidence of the final text is the sum of the confidences of all the matched sensitive words. If no sensitive word exists in the text, it returns a result of pass. If the final confidence of the text is more than 0.5, the returned result is rejection; if the final confidence of the text is greater than 0 and less than 0.5, the returned result is manual review.
According to the content auditing method based on the character-level and word-level fusion, which is provided by the embodiment of the application, the text content can be automatically audited based on the character-level and word-level fusion of the dynamic word bank, so that the time and energy consumed by manual auditing are effectively reduced, the auditing efficiency is improved, the auditing cost is reduced, the accuracy of the word level is ensured by the result after fusion, the semantic information of the word level is increased, the auditing result is more reasonable, and the auditing error is greatly reduced; and a search tree based on a sensitive word bank is constructed by utilizing the property of the common prefix of the tree model, so that the response time of text examination is greatly shortened, and the effectiveness and the efficiency of the text examination are improved.
The following explains a content auditing method based on character-level and word-level fusion by a specific embodiment, and as shown in fig. 2, the method includes the following steps:
1. text pre-processing
Unifying English letters in the text to be audited into lower case letters; complex characters in the text are unified into simplified characters; deleting characters in the text; removing the conventional stop words.
2. Text word segmentation
The word segmentation operation is carried out on the text to be audited, the text content is divided into word-level granularity, the semantic information of the original content is considered in the text after word segmentation, and the semantic-level support capability can be provided for the subsequent audit result.
3. Search mode
By constructing the search tree, the public prefix of the character string is utilized to reduce the expense of query time, reduce response time and shorten the auditing period.
4. Sensitive word comparison and fusion
Judging whether the sensitive words are contained in the words after word segmentation:
(1) if the judgment result is true and the length of the sensitive word is equal to that of the participle word, the participle result of the sensitive word is the same and hit, and the confidence coefficient of the sensitive word fused with the participle word is 1;
(2) if the word is judged to be true and the sensitive word is a true subset of the word after word segmentation, namely the length of the sensitive word is less than the length of the word after word segmentation and the word is hit, the confidence coefficient of the sensitive word after the word and the word are fused is 0.2;
(3) if the word is judged to be false and the merged adjacent participles are still equal to the sensitive word, namely the participles are a proper subset of the sensitive word, but the merged adjacent participles are still the same as the hit sensitive word in length, and the confidence coefficient of the fused sensitive word is 1; therefore, the problem of sensitive word segmentation caused by word segmentation is solved.
5. Dynamic management of sensitive lexicon
The user can customize the personalized sensitive word stock according to the business requirement, the trouble that a direct matching mode is limited by a fixed word stock is solved, and the text examination is more flexible; the operations of adding, deleting, modifying and searching the user-defined word bank are all synchronized into the search tree so as to ensure the efficiency of search matching.
In summary, the content auditing method based on the character-level and word-level fusion of the dynamic lexicon increases the flexible adaptation of the auditing result, and can simply and efficiently complete the text auditing work; the dynamic management of the sensitive word bank can be flexibly supported, and the limitation of the sensitive word bank on the text audit result is reduced; the auditing effectiveness of the character level is kept, the semantic information after word segmentation is increased, the influence of hard matching on the final result is reduced, the reliability of the auditing result is improved, the workload of manual operation and maintenance is reduced, and the high-performance stable operation of business application is ensured; meanwhile, a search matching mode is adopted, semantic space mapping is not needed to be carried out on text contents, and a tree model with a common prefix is constructed by the sensitive word bank according to a multi-branch tree structure, so that the auditing response time is shortened.
Next, a content auditing apparatus based on character-level and word-level fusion proposed according to an embodiment of the present application will be described with reference to the drawings.
Fig. 3 is a block diagram of a content auditing apparatus based on character-level and word-level fusion according to an embodiment of the present application.
As shown in fig. 3, the content auditing apparatus 10 based on character-level and word-level fusion includes: a matching module 100, a partitioning module 200 and a fusion module 300.
The matching module 100 is configured to perform search matching on a text to be checked according to the granularity of the character level to obtain a search matching result; the dividing module 200 is configured to perform semantic level division on a text to be audited by using a word segmentation method to obtain a divided word-level text, and perform sensitive word comparison on the divided word-level text and a search matching result to obtain a comparison result; the fusion module 300 is configured to perform fusion based on the comparison result, calculate a confidence of the fused sensitive word, and obtain a content review result from the confidence of the sensitive word.
Further, the apparatus 10 of the embodiment of the present application further includes: the device comprises a preprocessing module and a management module.
The system comprises a preprocessing module, a verification module and a verification module, wherein the preprocessing module is used for performing text preprocessing on an initial text to obtain a text to be verified meeting verification conditions; and the management module is used for acquiring the service requirement of the user, obtaining the corresponding individual sensitive words according to the service requirement, and adding the individual sensitive words into the sensitive word bank for searching and matching.
Further, the matching module 100 is used to construct a search tree for searching for matches; and extracting the public prefix of the character string of the text to be audited, and obtaining a search matching result by utilizing a search tree based on the public prefix.
Further, the fusion module 300 is configured to, when the comparison results are the same and hit, determine that the confidence of the sensitive word is 1; when the length of the sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word is hit, the confidence coefficient of the sensitive word is 0.2; and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.
It should be noted that the foregoing explanation of the embodiment of the content auditing method based on character-level and word-level fusion is also applicable to the content auditing device based on character-level and word-level fusion in this embodiment, and details are not repeated here.
According to the content auditing device based on the character-level and word-level fusion, which is provided by the embodiment of the application, the text content can be automatically audited based on the character-level and word-level fusion of the dynamic word bank, so that the time and energy consumed by manual auditing are effectively reduced, the auditing efficiency is improved, the auditing cost is reduced, the accuracy of the word level is ensured by the result after fusion, the semantic information of the word level is increased, the auditing result is more reasonable, and the auditing error is greatly reduced; and a search tree based on a sensitive word bank is constructed by utilizing the property of the common prefix of the tree model, so that the response time of text examination is greatly shortened, and the effectiveness and the efficiency of the text examination are improved.
Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device may include:
memory 401, processor 402, and computer programs stored on memory 401 and executable on processor 402.
The processor 402, when executing the program, implements the character-level and word-level fusion-based content auditing method provided in the above-described embodiments.
Further, the electronic device further includes:
a communication interface 403 for communication between the memory 401 and the processor 402.
A memory 401 for storing computer programs executable on the processor 402.
Memory 401 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 401, the processor 402 and the communication interface 403 are implemented independently, the communication interface 403, the memory 401 and the processor 402 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.
Optionally, in a specific implementation, if the memory 401, the processor 402, and the communication interface 403 are integrated on a chip, the memory 401, the processor 402, and the communication interface 403 may complete mutual communication through an internal interface.
The processor 402 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the content auditing method based on character-level and word-level fusion as described above.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present application.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (10)

1. A content auditing method based on character-level and word-level fusion is characterized by comprising the following steps:
searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result;
performing semantic level division on the text to be audited by using a word segmentation method to obtain a divided word-level text, and performing sensitive word comparison on the divided word-level text and the search matching result to obtain a comparison result; and
and fusing based on the comparison result, calculating the confidence of the fused sensitive words, and obtaining a content verification result according to the confidence of the sensitive words.
2. The method of claim 1, further comprising:
and performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions.
3. The method of claim 1, wherein the performing search matching on the text to be reviewed according to the character-level granularity to obtain a search matching result comprises:
constructing a search tree for searching for matches;
and extracting a public prefix of the character string of the text to be audited, and obtaining the search matching result by utilizing the search tree based on the public prefix.
4. The method of claim 1, wherein the fusing based on the comparison result and calculating the confidence of the fused sensitive word comprise:
when the comparison results are the same and hit, the confidence of the sensitive word is 1;
when the length of a sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word-level text is hit, the confidence coefficient of the sensitive word is 0.2;
and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.
5. The method according to any one of claims 1-4, further comprising:
acquiring the service requirement of a user;
and obtaining corresponding individual sensitive words according to the service requirements, and adding the individual sensitive words into a sensitive word bank for searching and matching.
6. A content auditing device based on character-level and word-level fusion, comprising:
the matching module is used for searching and matching the text to be audited according to the granularity of the character level to obtain a search matching result;
the division module is used for performing semantic level division on the text to be audited by utilizing a word segmentation method to obtain a divided word-level text, and performing sensitive word comparison on the divided word-level text and the search matching result to obtain a comparison result; and
and the fusion module is used for fusing based on the comparison result, calculating the confidence coefficient of the fused sensitive words and obtaining a content verification result according to the confidence coefficient of the sensitive words.
7. The apparatus of claim 6, further comprising:
the preprocessing module is used for performing text preprocessing on the initial text to obtain the text to be audited meeting the auditing conditions;
the management module is used for acquiring the service requirement of a user, obtaining the corresponding individual sensitive words according to the service requirement, and adding the individual sensitive words into a sensitive word bank for searching and matching.
8. The apparatus of claim 6,
the matching module is used for constructing a search tree for searching matching; extracting a public prefix of the character string of the text to be audited, and obtaining the search matching result by utilizing the search tree based on the public prefix;
the fusion module is used for setting the confidence coefficient of the sensitive word to be 1 when the comparison results are the same and hit; when the length of a sensitive word of which the comparison result is the search matching result is smaller than the word segmentation length of the word-level text and the word-level text is hit, the confidence coefficient of the sensitive word is 0.2; and when the comparison result is that the word-level text is a proper subset of the sensitive words of the search matching result, and the length of the combined adjacent participles is the same as that of the sensitive words, the confidence coefficient of the sensitive words is 1.
9. An electronic device, comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the character-level and word-level fusion based content auditing method according to any one of claims 1-5.
10. A computer-readable storage medium, on which a computer program is stored, the program being executed by a processor for implementing a method for content auditing based on character-level and word-level fusion according to any one of claims 1-5.
CN202111599306.XA 2021-12-24 2021-12-24 Content auditing method and device, electronic equipment and storage medium Pending CN114329112A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111599306.XA CN114329112A (en) 2021-12-24 2021-12-24 Content auditing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111599306.XA CN114329112A (en) 2021-12-24 2021-12-24 Content auditing method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114329112A true CN114329112A (en) 2022-04-12

Family

ID=81013603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111599306.XA Pending CN114329112A (en) 2021-12-24 2021-12-24 Content auditing method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114329112A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002508A (en) * 2022-06-07 2022-09-02 中国工商银行股份有限公司 Live data stream method and device, computer equipment and storage medium
CN115186657A (en) * 2022-07-28 2022-10-14 北京网景盛世技术开发中心 Error sensitive information detection method, device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115002508A (en) * 2022-06-07 2022-09-02 中国工商银行股份有限公司 Live data stream method and device, computer equipment and storage medium
CN115186657A (en) * 2022-07-28 2022-10-14 北京网景盛世技术开发中心 Error sensitive information detection method, device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111460787B (en) Topic extraction method, topic extraction device, terminal equipment and storage medium
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN110765770A (en) Automatic contract generation method and device
WO2022218186A1 (en) Method and apparatus for generating personalized knowledge graph, and computer device
US5784489A (en) Apparatus and method for syntactic signal analysis
CN108959559B (en) Question and answer pair generation method and device
WO2020259280A1 (en) Log management method and apparatus, network device and readable storage medium
CN114329112A (en) Content auditing method and device, electronic equipment and storage medium
CN111401065A (en) Entity identification method, device, equipment and storage medium
TWI844091B (en) Feature matching rule construction, feature matching method, device, equipment and medium
CN103678271A (en) Text correction method and user equipment
CN113961768A (en) Sensitive word detection method and device, computer equipment and storage medium
CN112733517B (en) Method for checking requirement template conformity, electronic equipment and storage medium
CN113434631A (en) Emotion analysis method and device based on event, computer equipment and storage medium
CN117216279A (en) Text extraction method, device and equipment of PDF (portable document format) file and storage medium
CN116756382A (en) Method, device, setting and storage medium for detecting sensitive character string
CN109902309B (en) Translation method, device, equipment and storage medium
CN109993190B (en) Ontology matching method and device and computer storage medium
WO2023087702A1 (en) Text recognition method for form certificate image file, and computing device
JP2004133565A (en) Postprocessing device for character recognition using internet
CN113255374B (en) Question and answer management method and system
CN113033193B (en) Mixed Chinese text word segmentation method based on C++ language
CN113779200A (en) Target industry word stock generation method, processor and device
CN111625579B (en) Information processing method, device and system
CN116306616B (en) Method and device for determining keywords of text

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination