CN111274782A

CN111274782A - Text auditing method and device, computer equipment and readable storage medium

Info

Publication number: CN111274782A
Application number: CN202010116229.7A
Authority: CN
Inventors: 张晶莹; 罗先贤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2020-06-12
Anticipated expiration: 2040-02-25
Also published as: WO2021169208A1; CN111274782B

Abstract

The invention discloses a text auditing method body, which comprises the following steps: receiving a text to be audited sent by a user terminal, and matching the text to be audited with text templates of a plurality of text types to determine the text type of the text to be audited; acquiring a classification model corresponding to the text type from a preset classification model library, dividing the text to be audited into a plurality of audit fragments by using the classification model, and adding a corresponding theme label to each audit fragment; according to the theme label of each audit fragment, acquiring the audit rule corresponding to each theme label from the rule base corresponding to the text type; judging whether risk element content exists in the corresponding audit segment or not according to the audit rule, if so, sending the risk element content to the user terminal for risk prompt; the invention can improve the accuracy and speed of text examination.

Description

Text auditing method and device, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of internet, in particular to a text auditing method and device, computer equipment and a readable storage medium.

Background

With the continuous development of internet technology, more and more information is spread through the internet; wherein, an important carrier of information transmission is text; sensitive or bad information can be contained in the text, so that in order to prevent sensitive information from being leaked and prevent the spread of the bad information, auditors need to manually audit the risk content of the text; however, because the number of characters in each text is large, the content is complicated, the expressions are various, and a large amount of labor cost is required, the auditing efficiency is low, and the auditing accuracy cannot be guaranteed. Therefore, how to improve the efficiency and accuracy of text review becomes a technical problem to be solved urgently at present.

Disclosure of Invention

The invention aims to provide a text auditing method, a text auditing device, computer equipment and a readable storage medium, which can improve the accuracy and speed of auditing texts.

According to an aspect of the present invention, a text auditing method is provided, which specifically includes the following steps:

receiving a text to be audited sent by a user terminal, and matching the text to be audited with text templates of a plurality of text types to determine the text type of the text to be audited;

acquiring a classification model corresponding to the text type from a preset classification model library, dividing the text to be audited into a plurality of audit fragments by using the classification model, and adding a corresponding theme label to each audit fragment;

according to the theme label of each audit fragment, acquiring the audit rule corresponding to each theme label from the rule base corresponding to the text type;

and judging whether the risk element content exists in the corresponding audit segment or not according to the audit rule, if so, sending the risk element content to the user terminal for risk prompt.

Optionally, before the obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be audited into a plurality of audit fragments by using the classification model, and adding a corresponding theme tag to each audit fragment, the method further includes:

aiming at a text type, acquiring a training sample set corresponding to the text type; wherein the training sample set comprises: setting a number of historical texts, fragment information of each historical text and a theme label of each fragment;

according to the topic labels contained in each historical text in the training sample set, determining the topic labels contained in all the historical texts as the necessary topic labels of the text type;

and training and learning a preset model according to the training sample set to obtain a classification model corresponding to the text type.

Optionally, the training and learning a preset model according to the training sample set to obtain a classification model corresponding to the text type specifically includes:

aiming at one topic label in the training sample set, acquiring a segment corresponding to the topic label in each historical text; performing word segmentation processing on each acquired segment, and extracting a noun of each segment; determining a set number of significant nouns for representing the topic tag from the nouns of all the segments, and calculating a significant coefficient of each significant noun to form a significant word set corresponding to the topic tag;

and converging the significant word sets of the topic labels in the training sample set to serve as the classification models corresponding to the text types.

Optionally, the splitting the text to be audited into a plurality of audit fragments by using the classification model, and adding a corresponding theme tag to each audit fragment, specifically including:

determining each title contained in the text to be audited, and splitting the text to be audited into a plurality of audit fragments according to the determined titles; wherein, each audit fragment comprises: a title portion and a body portion;

performing word segmentation processing on each audit fragment respectively, and extracting a noun of each audit fragment;

respectively determining target significant words from each significant word set aiming at one audit fragment, wherein the target significant words are nouns which appear in the significant word set and the audit fragment at the same time; calculating the sum of the significant coefficients of each significant word set according to the significant coefficients of the target significant words in each significant word set; and adding the theme label corresponding to the significant word set with the maximum significant coefficient sum to the audit fragment.

Optionally, the obtaining, according to the theme tag of each audit fragment, the audit rule corresponding to each theme tag from the rule base corresponding to the text type includes:

judging whether all the necessary subject labels of the text type are contained in all the subject labels of the text to be audited; if so, acquiring the auditing rule corresponding to each theme label from the rule base corresponding to the text type according to the theme label of each auditing segment; and if not, sending the information containing the missing necessary theme tags to the user terminal.

Optionally, the audit rule includes: auditing elements and auditing sub-rules, wherein one auditing element corresponds to one auditing sub-rule;

the method includes the steps of judging whether risk element content exists in corresponding audit fragments according to the audit rules, if so, sending the risk element content to the user terminal to prompt risks, and specifically includes the following steps:

according to each audit element in the audit rule, element content corresponding to each audit element is extracted from the audit fragment;

aiming at the element content of one auditing element, judging whether the element content meets an auditing sub-rule corresponding to the auditing element; and if not, sending the element content serving as risk element content to the user terminal.

Optionally, after determining whether risk element content exists in the corresponding audit segment according to the audit rule, if yes, sending the risk element content to the user terminal for risk prompt, the method further includes:

receiving audit result information sent by the user terminal, and judging whether the determined risk element content is correct or not according to the audit result information; if so, adding one to the accurate value of the auditing rule corresponding to the risk element content; if not, subtracting one from the accurate value of the auditing rule corresponding to the risk element content;

and sending the audit rule with the accuracy value smaller than the preset threshold value to the user terminal so that the user terminal can modify the audit rule.

According to another aspect of the present invention, there is also provided a text auditing apparatus, specifically including the following components:

the receiving module is used for receiving a text to be audited sent by a user terminal and matching the text to be audited with text templates of a plurality of text types to determine the text type of the text to be audited;

the splitting module is used for acquiring a classification model corresponding to the text type from a preset classification model library, splitting the text to be audited into a plurality of audit fragments by using the classification model, and adding corresponding theme tags to each audit fragment;

the acquisition module is used for respectively acquiring the auditing rules corresponding to the subject labels from the rule base corresponding to the text types according to the subject label of each auditing segment;

and the judging module is used for judging whether the risk element content exists in the corresponding auditing segment or not according to the auditing rule, and if so, sending the risk element content to the user terminal so as to prompt the risk.

According to another aspect of the present invention, there is also provided a computer device, specifically including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the text auditing method when executing the program.

According to another aspect of the present invention, there is also provided a computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the above-mentioned text auditing method.

According to the text auditing method, the text auditing device, the computer equipment and the readable storage medium, the text to be audited is divided into a plurality of audit fragments, and corresponding audit rules are set for each audit fragment; and performing text audit on the corresponding audit segment according to each audit rule, so that risk check can be performed in a targeted manner, and the accuracy of the text audit is improved. In addition, in the invention, each audit fragment in the text to be audited can be audited in parallel, thereby improving the efficiency of auditing the text.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 is an alternative flow diagram of a text auditing method according to an embodiment;

fig. 2 is a schematic diagram of an optional program module of the text auditing apparatus according to the second embodiment;

fig. 3 is a schematic diagram of an alternative hardware architecture of the computer device according to the third embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment of the invention provides a text auditing method, which specifically comprises the following steps as shown in figure 1:

step S101: receiving a text to be audited sent by a user terminal, and matching the text to be audited with text templates of a plurality of text types to determine the text type of the text to be audited.

Preferably, the text in this embodiment may be a contract; the contract is related to the benefit of a company or a person, and in an actual business scene, in order to ensure the right and obligation of both parties of the contract, the contract content needs to be checked. Therefore, in step S101, when a contract to be audited is received, the contract structure of the contract to be audited is analyzed to determine the contract type of the contract to be audited.

In this embodiment, the contract to be audited is generated according to different types of contract templates, each type of contract template has a corresponding contract structure, and the type of the contract template used by the contract to be audited can be determined by analyzing the contract structure of the contract to be audited, so as to obtain the contract type of the contract to be audited.

Specifically, the contract types include: a purchase class, a sales class, an intent-to-collaborate class, and a privacy class.

Step S102: and acquiring a classification model corresponding to the text type from a preset classification model library, dividing the text to be audited into a plurality of audit fragments by using the classification model, and adding a corresponding theme label to each audit fragment.

Specifically, before step S102, the method further includes:

step A1: aiming at a text type, acquiring a training sample set corresponding to the text type; wherein the training sample set comprises: setting a number of historical texts, fragment information of each historical text and a theme label of each fragment;

contracts typically include multiple parts, each with a corresponding title and body; in the scene of manually checking the contracts, the auditor checks one part of the contracts when checking the contracts so as to determine whether each part meets the corresponding terms in the aspects of laws and the like, so that each historical contract in the training sample set is divided into a plurality of segments according to the titles and the texts according to the checking habits of the auditor, and corresponding subject labels are added to each divided segment according to the content of each segment.

For example, the to-be-audited contract for the procurement class is divided into the following segments: both parties rights and obligations, fees and payments, default obligations and limits of liability, third party rights guarantees, independence and divisibility, protocol changes and terminations, contract subject and product/service standards, intellectual property, contract validation and expiration, best-effort treatment.

Step A2: and determining the topic labels contained in all the historical texts as the necessary topic labels of the text types according to the topic labels contained in each historical text in the training sample set.

Step A3: and training and learning a preset model according to the training sample set to obtain a classification model corresponding to the text type.

Further, the training and learning a preset model according to the training sample set to obtain a classification model corresponding to the text type specifically includes:

step A31: aiming at one topic label in the training sample set, acquiring a segment corresponding to the topic label in each historical text;

step A32: performing word segmentation processing on each acquired segment, and extracting a noun of each segment;

step A33: determining a set number of significant nouns for representing the topic tag from the nouns of all the segments, and calculating a significant coefficient of each significant noun to form a significant word set corresponding to the topic tag;

step A34: and converging the significant word sets of the topic labels in the training sample set to serve as the classification models corresponding to the text types.

It should be noted that each significant noun in the display word set has a corresponding significant coefficient; the larger the saliency coefficient value of a salient noun is, the more representative the salient noun is of the corresponding topic label.

Preferably, in practical applications, in step a33, the nouns are sorted in descending order according to the occurrence probability of each noun in each segment, a set number of nouns arranged in front are set as significant nouns, and a corresponding significant coefficient is calculated according to the occurrence probability of each significant noun.

In addition, in practical application, the preset model can also adopt a naive Bayes classification model, and the naive Bayes classification model is trained and learned according to the training sample set to obtain a classification model corresponding to the text type.

Further, step S102 includes:

step B1: determining each title contained in the text to be audited, and splitting the text to be audited into a plurality of audit fragments according to the determined titles; wherein, each audit fragment comprises: a title portion and a body portion;

step B2: performing word segmentation processing on each audit fragment respectively, and extracting a noun of each audit fragment;

step B3: respectively determining target significant words from each significant word set aiming at one audit fragment, wherein the target significant words are nouns which appear in the significant word set and the audit fragment at the same time; calculating the sum of the significant coefficients of each significant word set according to the significant coefficients of the target significant words in each significant word set; and adding the theme label corresponding to the significant word set with the maximum significant coefficient sum to the audit fragment.

Step S103: and respectively acquiring the auditing rule corresponding to each theme label from the rule base corresponding to the text type according to the theme label of each auditing segment.

Specifically, step S103 includes:

In this embodiment, the integrity of the contract to be audited is first audited, whether the contract to be audited lacks necessary content is determined according to the type of the subject label included in the contract to be audited, and a reminding operation is performed when the contract to be audited lacks the necessary subject label.

In this embodiment, corresponding rule bases are set for different types of contracts in advance; the rule base comprises audit rules corresponding to different subject labels, namely, each audit segment in the contract to be audited has a corresponding audit rule, and risk check is performed in a targeted manner according to the audit rule of each audit segment, so that the contract audit accuracy is improved.

Specifically, the audit rule includes: auditing elements and auditing sub-rules, wherein one auditing element corresponds to one auditing sub-rule; the audit element is the minimum audit unit of the text audit, and the audit sub-rule is a judgment rule used for performing risk audit on the audit element.

For example, when the contract type is purchase type and the subject label of the audit fragment is fee and payment, the corresponding audit element of the audit rule includes: payment deadline, billing period, fee, tax; aiming at the audit factor as expense, the audit sub-rule is as follows: and judging whether the sum and the unit of the sum are included, and if not, risking.

Step S104: and judging whether the risk element content exists in the corresponding audit segment or not according to the audit rule, if so, sending the risk element content to the user terminal for risk prompt.

Specifically, step S104 includes:

step C1: according to each audit element in the audit rule, element content corresponding to each audit element is extracted from the audit fragment;

step C2: aiming at the element content of one auditing element, judging whether the element content meets an auditing sub-rule corresponding to the auditing element; and if not, sending the element content serving as risk element content to the user terminal.

Further, the determining whether the element content meets an audit sub-rule corresponding to the audit element includes:

judging whether the element content contains preset keywords or not; or,

judging whether the element content is consistent with preset content or not; or,

and judging whether the capital and the lowercase of the currency or the amount contained in the element content are consistent.

In this embodiment, the contract to be audited is divided into a plurality of audit fragments, and each audit fragment in the contract to be audited can be audited in parallel, so that the efficiency of auditing the contract is improved; in addition, the corresponding auditing rule is set for each auditing segment, so that contract auditing can be performed in a targeted manner, and the accuracy is higher.

Further, after step S104, the method further includes:

step D1: receiving audit result information sent by the user terminal, judging whether the determined risk element content is correct or not according to the audit result information, if so, adding one to the accurate value of the audit rule corresponding to the risk element content, and if not, subtracting one from the accurate value of the audit rule corresponding to the risk element content;

in this embodiment, an accurate value is set for each audit rule, and the initialized accurate values of each audit rule are consistent; when the risk element content is sent to the user terminal, the user manually corrects the risk element content according to the professional knowledge background of the user and feeds back the auditing result information; and adjusting the accurate value of each audit rule according to the audit result information.

Step D2: sending the audit rule with the accuracy value smaller than a preset threshold value to the user terminal so that the user terminal can modify the audit rule;

in this embodiment, the audit rule is continuously modified by using the audit result information, so that the audit rule is continuously improved.

Example two

An embodiment of the present invention provides a text auditing apparatus, which specifically includes, as shown in fig. 2:

the receiving module 201 is configured to receive a text to be checked sent by a user terminal, and perform text structure matching on the text to be checked and text templates of multiple text types to determine a text type of the text to be checked;

the splitting module 202 is configured to obtain a classification model corresponding to the text type from a preset classification model library, split the text to be checked into multiple checking segments by using the classification model, and add a corresponding theme tag to each checking segment;

the obtaining module 203 is configured to obtain, according to the theme tag of each audit fragment, an audit rule corresponding to each theme tag from the rule base corresponding to the text type;

the judging module 204 is configured to judge whether risk element content exists in the corresponding audit segment according to the audit rule, and if yes, send the risk element content to the user terminal for risk prompt.

Specifically, the apparatus further comprises:

the training module is used for acquiring a classification model corresponding to the text type from a preset classification model library, splitting the text to be audited into a plurality of audit fragments by using the classification model, and acquiring a training sample set corresponding to the text type for one text type before adding a corresponding theme label to each audit fragment; wherein the training sample set comprises: setting a number of historical texts, fragment information of each historical text and a theme label of each fragment; according to the topic labels contained in each historical text in the training sample set, determining the topic labels contained in all the historical texts as the necessary topic labels of the text type; and training and learning a preset model according to the training sample set to obtain a classification model corresponding to the text type.

Further, the training module specifically includes, when implementing the function of training and learning a preset model according to the training sample set to obtain a classification model corresponding to the text type:

aiming at one topic label in the training sample set, acquiring a segment corresponding to the topic label in each historical text; performing word segmentation processing on each acquired segment, and extracting a noun of each segment; determining a set number of significant nouns for representing the topic tag from the nouns of all the segments, and calculating a significant coefficient of each significant noun to form a significant word set corresponding to the topic tag; and converging the significant word sets of the topic labels in the training sample set to serve as the classification models corresponding to the text types.

In addition, the splitting module 202 is specifically configured to:

determining each title contained in the text to be audited, and splitting the text to be audited into a plurality of audit fragments according to the determined titles; wherein, each audit fragment comprises: a title portion and a body portion; performing word segmentation processing on each audit fragment respectively, and extracting a noun of each audit fragment; respectively determining target significant words from each significant word set aiming at one audit fragment, wherein the target significant words are nouns which appear in the significant word set and the audit fragment at the same time; calculating the sum of the significant coefficients of each significant word set according to the significant coefficients of the target significant words in each significant word set; and adding the theme label corresponding to the significant word set with the maximum significant coefficient sum to the audit fragment.

The obtaining module 203 is specifically configured to:

Further, the audit rule includes: auditing elements and auditing sub-rules, wherein one auditing element corresponds to one auditing sub-rule;

in addition, the determining module 204 is specifically configured to:

according to each audit element in the audit rule, element content corresponding to each audit element is extracted from the audit fragment; aiming at the element content of one auditing element, judging whether the element content meets an auditing sub-rule corresponding to the auditing element; and if not, sending the element content serving as risk element content to the user terminal.

Still further, the apparatus further comprises:

the correction module is used for judging whether risk element content exists in a corresponding audit segment or not according to the audit rule, if so, sending the risk element content to the user terminal so as to receive audit result information sent by the user terminal after risk prompt is carried out, and judging whether the determined risk element content is correct or not according to the audit result information; if so, adding one to the accurate value of the auditing rule corresponding to the risk element content; if not, subtracting one from the accurate value of the auditing rule corresponding to the risk element content; and sending the audit rule with the accuracy value smaller than the preset threshold value to the user terminal so that the user terminal can modify the audit rule.

EXAMPLE III

The embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server or a rack server (including an independent server or a server cluster composed of a plurality of servers) capable of executing programs, and the like. As shown in fig. 3, the computer device 30 of the present embodiment includes at least but is not limited to: a memory 301, a processor 302 communicatively coupled to each other via a system bus. It is noted that FIG. 3 only shows the computer device 30 having components 301 and 302, but it is understood that not all of the shown components are required and that more or fewer components may be implemented instead.

In this embodiment, the memory 301 (i.e., the readable storage medium) includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the storage 301 may be an internal storage unit of the computer device 30, such as a hard disk or a memory of the computer device 30. In other embodiments, the memory 301 may also be an external storage device of the computer device 30, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 30. Of course, the memory 301 may also include both internal and external storage devices for the computer device 30. In this embodiment, the memory 301 is generally used for storing an operating system installed in the computer device 30 and various application software, such as program codes of the text auditing apparatus according to the second embodiment. In addition, the memory 301 may also be used to temporarily store various types of data that have been output or are to be output.

Processor 302 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 302 generally serves to control the overall operation of the computer device 30.

Specifically, in this embodiment, the processor 302 is configured to execute a program of a text auditing method stored in the processor 302, and when executed, the program of the text auditing method implements the following steps:

The specific embodiment process of the above method steps can be referred to in the first embodiment, and the detailed description of this embodiment is not repeated here.

Example four

The present embodiments also provide a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application mall, etc., having stored thereon a computer program that when executed by a processor implements the method steps of:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A text auditing method, characterized in that the method comprises:

2. The text auditing method of claim 1, before the obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be audited into multiple audit fragments using the classification model, and adding a corresponding topic tag for each audit fragment, the method further comprising:

3. The text auditing method according to claim 2, wherein the training and learning of a preset model according to the training sample set to obtain a classification model corresponding to the text type specifically comprises:

4. The text auditing method according to claim 3, wherein the splitting of the text to be audited into multiple audit fragments using the classification model and the addition of a corresponding topic tag for each audit fragment specifically comprises:

5. The text review method according to claim 2, wherein the obtaining, according to the theme tag of each review piece, the review rule corresponding to each theme tag from the rule base corresponding to the text type includes:

6. A text auditing method according to claim 1, where the auditing rules include: auditing elements and auditing sub-rules, wherein one auditing element corresponds to one auditing sub-rule;

7. The text auditing method according to claim 1, where after said determining, according to the auditing rules, whether there is risk element content in the corresponding audit segment, and if so, sending the risk element content to the user terminal for risk prompt, the method further comprises:

8. A text auditing apparatus, characterized in that the apparatus comprises:

9. A computer device, the computer device comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 7 are implemented when the processor executes the program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.