CN111274782B

CN111274782B - Text auditing method and device, computer equipment and readable storage medium

Info

Publication number: CN111274782B
Application number: CN202010116229.7A
Authority: CN
Inventors: 张晶莹; 罗先贤
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2023-10-20
Anticipated expiration: 2040-02-25
Also published as: WO2021169208A1; CN111274782A

Abstract

The invention discloses a text auditing method body, which comprises the following steps: receiving a text to be checked sent by a user terminal, and matching the text to be checked with text templates of a plurality of text types in a text structure to determine the text type of the text to be checked; obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking fragments by using the classification model, and adding a corresponding theme label for each checking fragment; according to the topic label of each audit fragment, respectively acquiring audit rules corresponding to each topic label from a rule base corresponding to the text type; judging whether risk element contents exist in the corresponding audit fragments according to the audit rules, if so, sending the risk element contents to the user terminal to carry out risk prompt; the invention can improve the accuracy and speed of the audit text.

Description

Text auditing method and device, computer equipment and readable storage medium

Technical Field

The invention relates to the technical field of internet, in particular to a text auditing method, a text auditing device, computer equipment and a readable storage medium.

Background

With the continuous development of internet technology, more and more information is transmitted through the internet; one important carrier for information propagation is text; because sensitive or bad information is contained in the text, in order to prevent sensitive information from being leaked and bad information from being spread, auditing personnel are required to audit the risk content of the text manually; however, since the number of characters in each text is large, the content is complex, the expression is various, more labor cost is required, the auditing efficiency is low, and the auditing accuracy is not guaranteed. Therefore, how to improve the efficiency and accuracy of the audit text becomes a technical problem to be solved at present.

Disclosure of Invention

The invention aims to provide a text auditing method, a text auditing device, computer equipment and a readable storage medium, which can improve the accuracy and speed of auditing texts.

According to one aspect of the invention, there is provided a text auditing method, comprising the steps of:

receiving a text to be checked sent by a user terminal, and matching the text to be checked with text templates of a plurality of text types in a text structure to determine the text type of the text to be checked;

obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking fragments by using the classification model, and adding a corresponding theme label for each checking fragment;

according to the topic label of each audit fragment, respectively acquiring audit rules corresponding to each topic label from a rule base corresponding to the text type;

judging whether the risk element content exists in the corresponding audit fragment according to the audit rule, if so, sending the risk element content to the user terminal to carry out risk prompt.

Optionally, before the obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking segments by using the classification model, and adding a corresponding topic label to each checking segment, the method further includes:

aiming at a text type, acquiring a training sample set corresponding to the text type; wherein the training sample set comprises: setting a number of historical texts, fragment information of each historical text and a theme label of each fragment;

determining the topic labels contained in all the historical texts as necessary topic labels of the text types according to topic labels contained in each historical text in the training sample set;

training and learning the preset model according to the training sample set to obtain a classification model corresponding to the text type.

Optionally, training and learning the preset model according to the training sample set to obtain a classification model corresponding to the text type, which specifically includes:

aiming at one topic label in the training sample set, obtaining fragments corresponding to the topic labels in each historical text; performing word segmentation on each acquired segment, and extracting nouns of each segment; determining a set number of salient nouns representing the topic label from nouns of all fragments, and calculating a salient coefficient of each salient noun to form a salient word set corresponding to the topic label;

and converging the salient word sets of the theme labels in the training sample set to be used as a classification model corresponding to the text type.

Optionally, the splitting the text to be inspected into a plurality of inspection fragments by using the classification model, and adding a corresponding topic label to each inspection fragment specifically includes:

determining each title contained in the text to be checked, and splitting the text to be checked into a plurality of checking fragments according to each determined title; wherein each audit segment includes: a title portion and a body portion;

word segmentation processing is carried out on each audit segment, and nouns of each audit segment are extracted;

aiming at an audit segment, determining a target salient word from each salient word set, wherein the target salient word is a noun which simultaneously appears in the salient word set and the audit segment; calculating the sum of the significant coefficients of each significant word set according to the significant coefficients of the target significant words in each significant word set; and adding the topic label corresponding to the salient word set with the largest sum of the salient coefficients to the auditing segment.

Optionally, according to the topic label of each audit fragment, audit rules corresponding to each topic label are respectively obtained from a rule base corresponding to the text type, and specifically include:

judging whether all the necessary topic labels of the text type are contained in all topic labels of the text to be checked; if yes, respectively acquiring auditing rules corresponding to each theme label from a rule base corresponding to the text type according to the theme label of each auditing fragment; if not, the information containing the missing necessary theme label is sent to the user terminal.

Optionally, the auditing rule includes: an audit element and audit sub-rules, and one audit element corresponds to one audit sub-rule;

judging whether risk element contents exist in the corresponding audit fragments according to the audit rules, if so, sending the risk element contents to the user terminal to carry out risk prompt, wherein the method specifically comprises the following steps of:

extracting element content corresponding to each audit element from the audit fragment according to each audit element in the audit rule;

judging whether element content meets an audit rule corresponding to an audit element aiming at element content of the audit element; and if not, sending the element content to the user terminal as risk element content.

Optionally, after judging whether the risk element content exists in the corresponding audit segment according to the audit rule, if so, sending the risk element content to the user terminal to perform risk prompt, the method further includes:

receiving auditing result information sent by the user terminal, and judging whether the determined risk element content is correct or not according to the auditing result information; if yes, adding one to the accurate value of the auditing rule corresponding to the risk element content; if not, subtracting one from the accurate value of the auditing rule corresponding to the risk element content;

and sending the auditing rule with the accurate value smaller than the preset threshold value to the user terminal so as to enable the user terminal to modify the auditing rule.

According to another aspect of the present invention, there is also provided a text auditing apparatus, specifically including the following components:

the receiving module is used for receiving the text to be checked sent by the user terminal, and carrying out text structure matching on the text to be checked and the text templates of a plurality of text types so as to determine the text type of the text to be checked;

the splitting module is used for acquiring a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking fragments by using the classification model, and adding a corresponding theme label for each checking fragment;

the obtaining module is used for respectively obtaining the auditing rules corresponding to each theme label from the rule base corresponding to the text type according to the theme label of each auditing fragment;

and the judging module is used for judging whether the risk element content exists in the corresponding auditing fragment according to the auditing rule, and if so, the risk element content is sent to the user terminal so as to carry out risk prompt.

According to another aspect of the present invention, there is also provided a computer apparatus, including: the text auditing method comprises the steps of a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the program to realize the text auditing method.

According to another aspect of the present invention, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the text auditing method described above.

According to the text auditing method, the text auditing device, the computer equipment and the readable storage medium, the text to be audited is divided into a plurality of auditing fragments, and a corresponding auditing rule is set for each auditing fragment; and text auditing is carried out on the corresponding auditing fragments through each auditing rule, so that risk checking can be carried out in a targeted manner, and the accuracy of the text auditing is improved. In addition, each audit segment in the text to be audited can be audited in parallel, so that the efficiency of the audit text is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 is a schematic flow chart of an alternative text review method according to the first embodiment;

FIG. 2 is a schematic diagram of an alternative program module of the text review apparatus according to the second embodiment;

fig. 3 is a schematic diagram of an alternative hardware architecture of a computer device according to the third embodiment.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The embodiment of the invention provides a text auditing method, as shown in fig. 1, which specifically comprises the following steps:

step S101: receiving a text to be checked sent by a user terminal, and matching the text to be checked with text templates of a plurality of text types in a text structure to determine the text type of the text to be checked.

Preferably, the text in this embodiment may be a contract; the contract relates to the interests of the company or the person, and in the actual business scene, in order to ensure the rights and obligations of the two parties of the contract, the examination and verification of the contract content are required. Accordingly, in step S101, when a contract to be checked is received, the contract type of the contract to be checked is determined by analyzing the contract structure of the contract to be checked.

In this embodiment, the contract to be checked is generated according to different types of contract templates, each type of contract template has a corresponding contract structure, and the type of the contract template used by the contract to be checked can be determined by analyzing the contract structure of the contract to be checked, so as to obtain the contract type of the contract to be checked.

Specifically, the contract types include: purchase class, sales class, intent collaboration class, and privacy class.

Step S102: and obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking fragments by using the classification model, and adding a corresponding theme label for each checking fragment.

Specifically, before step S102, the method further includes:

step A1: aiming at a text type, acquiring a training sample set corresponding to the text type; wherein the training sample set comprises: setting a number of historical texts, fragment information of each historical text and a theme label of each fragment;

a contract typically includes a plurality of parts, each part having a corresponding title and body; in the manual auditing contract scene, auditing personnel can partially audit the contracts when auditing the contracts, so that whether each part meets corresponding legal terms or not is confirmed, each historical contract in the training sample set is divided into a plurality of fragments according to titles and texts according to auditing habits of the auditing personnel, and corresponding theme labels are added to each divided fragment according to the content of each fragment.

For example, the to-be-checked contract for the purchase class is divided into the following segments: double-party rights and obligations, fees and payments, default liabilities and liability restrictions, third party rights guarantees, independence and partitionability, protocol changes and terminations, contractual and product/service standards, intellectual property rights, contractual validation and deadlines, and best-at treatments.

Step A2: and determining the topic labels contained in all the historical texts as necessary topic labels of the text types according to topic labels contained in each historical text in the training sample set.

Step A3: training and learning the preset model according to the training sample set to obtain a classification model corresponding to the text type.

Further, training and learning the preset model according to the training sample set to obtain a classification model corresponding to the text type, which specifically includes:

step A31: aiming at one topic label in the training sample set, obtaining fragments corresponding to the topic labels in each historical text;

step A32: performing word segmentation on each acquired segment, and extracting nouns of each segment;

step A33: determining a set number of salient nouns representing the topic label from nouns of all fragments, and calculating a salient coefficient of each salient noun to form a salient word set corresponding to the topic label;

step A34: and converging the salient word sets of the theme labels in the training sample set to be used as a classification model corresponding to the text type.

It should be noted that, each salient noun in the displayed word set has a corresponding salient coefficient; the larger the saliency coefficient value of a saliency noun, the more representative the saliency noun can represent the corresponding subject label.

Preferably, in practical application, in step a33, the nouns arranged in the preset number are set as the salient nouns in descending order according to the occurrence probability of each noun in each segment, and the corresponding salient coefficients are calculated according to the occurrence probability of each salient noun.

In addition, in practical application, the preset model can also adopt a naive Bayes classification model, and training and learning are carried out on the naive Bayes classification model according to the training sample set so as to obtain a classification model corresponding to the text type.

Further, step S102 includes:

step B1: determining each title contained in the text to be checked, and splitting the text to be checked into a plurality of checking fragments according to each determined title; wherein each audit segment includes: a title portion and a body portion;

step B2: word segmentation processing is carried out on each audit segment, and nouns of each audit segment are extracted;

step B3: aiming at an audit segment, determining a target salient word from each salient word set, wherein the target salient word is a noun which simultaneously appears in the salient word set and the audit segment; calculating the sum of the significant coefficients of each significant word set according to the significant coefficients of the target significant words in each significant word set; and adding the topic label corresponding to the salient word set with the largest sum of the salient coefficients to the auditing segment.

Step S103: and respectively acquiring the auditing rules corresponding to each theme label from the rule base corresponding to the text type according to the theme label of each auditing fragment.

Specifically, step S103 includes:

In this embodiment, firstly, the integrity of the contract to be checked is checked, whether the contract to be checked lacks necessary content is determined according to the type of the theme label contained in the contract to be checked, and reminding operation is performed when the contract to be checked lacks necessary theme labels.

In this embodiment, corresponding rule bases are set for different types of contracts in advance, respectively; and the rule library comprises auditing rules corresponding to different topic labels, namely, each auditing segment in the contract to be audited has corresponding auditing rules, and risk inspection is carried out in a targeted manner through the auditing rules of each auditing segment, so that the accuracy of contract auditing is improved.

Specifically, the auditing rule includes: an audit element and audit sub-rules, and one audit element corresponds to one audit sub-rule; the auditing element is a minimum auditing unit for text auditing, and the auditing sub rule is a judging rule for risk auditing of the auditing element.

For example, when the contract type is a purchase type and the topic label of the audit fragment is a fee and payment, the audit elements of the corresponding audit rule include: payment period, accounting period, fee, tax; aiming at the audit factors as expense, the audit rules are as follows: judging whether the sum and the sum unit are contained, if not, the risk exists.

Step S104: judging whether the risk element content exists in the corresponding audit fragment according to the audit rule, if so, sending the risk element content to the user terminal to carry out risk prompt.

Specifically, step S104 includes:

step C1: extracting element content corresponding to each audit element from the audit fragment according to each audit element in the audit rule;

step C2: judging whether element content meets an audit rule corresponding to an audit element aiming at element content of the audit element; and if not, sending the element content to the user terminal as risk element content.

Further, the determining whether the element content meets the audit sub rule corresponding to the audit element includes:

judging whether the element content contains preset keywords or not; or alternatively, the process may be performed,

judging whether the element content is consistent with preset content or not; or alternatively, the process may be performed,

and judging whether the currency or the sum of the money contained in the element content is consistent.

In the embodiment, the contract to be audited is split into a plurality of audit fragments, and each audit fragment in the contract to be audited can be audited in parallel, so that the efficiency of the audit contract is improved; in addition, corresponding auditing rules are set for each auditing segment, so that contract auditing can be performed in a targeted manner, and the accuracy is higher.

Still further, after step S104, the method further includes:

step D1: receiving auditing result information sent by the user terminal, judging whether the determined risk element content is correct according to the auditing result information, if so, adding one to the accurate value of the auditing rule corresponding to the risk element content, and if not, subtracting one to the accurate value of the auditing rule corresponding to the risk element content;

in the embodiment, an accurate value is set for each audit rule, and the initialized accurate values of each audit rule are consistent; when the risk element content is sent to the user terminal, the user corrects the risk element content manually according to the professional knowledge background of the user, and feeds back the auditing result information; and then, according to the auditing result information, adjusting the accurate value of each auditing rule.

Step D2: transmitting an audit rule with an accurate value smaller than a preset threshold value to the user terminal so that the user terminal can modify the audit rule;

in this embodiment, the audit rule is continuously revised by using the audit result information, so that the audit rule is continuously perfected.

Example two

The embodiment of the invention provides a text auditing device, as shown in fig. 2, which specifically comprises the following components:

the receiving module 201 is configured to receive a text to be checked sent from a user terminal, and match the text to be checked with text templates of a plurality of text types to determine the text type of the text to be checked;

the splitting module 202 is configured to obtain a classification model corresponding to the text type from a preset classification model library, split the text to be checked into a plurality of audit fragments by using the classification model, and add a corresponding topic label to each audit fragment;

the obtaining module 203 is configured to obtain, according to the topic label of each audit segment, audit rules corresponding to each topic label from a rule base corresponding to the text type;

and the judging module 204 is configured to judge whether risk element content exists in the corresponding audit segment according to the audit rule, and if so, send the risk element content to the user terminal for risk prompting.

Specifically, the device further comprises:

the training module is used for acquiring a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking fragments by utilizing the classification model, and acquiring a training sample set corresponding to the text type for one text type before adding a corresponding theme label for each checking fragment; wherein the training sample set comprises: setting a number of historical texts, fragment information of each historical text and a theme label of each fragment; determining the topic labels contained in all the historical texts as necessary topic labels of the text types according to topic labels contained in each historical text in the training sample set; training and learning the preset model according to the training sample set to obtain a classification model corresponding to the text type.

Further, when implementing the training learning on the preset model according to the training sample set to obtain the function of the classification model corresponding to the text type, the training module specifically includes:

aiming at one topic label in the training sample set, obtaining fragments corresponding to the topic labels in each historical text; performing word segmentation on each acquired segment, and extracting nouns of each segment; determining a set number of salient nouns representing the topic label from nouns of all fragments, and calculating a salient coefficient of each salient noun to form a salient word set corresponding to the topic label; and converging the salient word sets of the theme labels in the training sample set to be used as a classification model corresponding to the text type.

In addition, the splitting module 202 is specifically configured to:

determining each title contained in the text to be checked, and splitting the text to be checked into a plurality of checking fragments according to each determined title; wherein each audit segment includes: a title portion and a body portion; word segmentation processing is carried out on each audit segment, and nouns of each audit segment are extracted; aiming at an audit segment, determining a target salient word from each salient word set, wherein the target salient word is a noun which simultaneously appears in the salient word set and the audit segment; calculating the sum of the significant coefficients of each significant word set according to the significant coefficients of the target significant words in each significant word set; and adding the topic label corresponding to the salient word set with the largest sum of the salient coefficients to the auditing segment.

The obtaining module 203 is specifically configured to:

Further, the auditing rule includes: an audit element and audit sub-rules, and one audit element corresponds to one audit sub-rule;

in addition, the judging module 204 is specifically configured to:

extracting element content corresponding to each audit element from the audit fragment according to each audit element in the audit rule; judging whether element content meets an audit rule corresponding to an audit element aiming at element content of the audit element; and if not, sending the element content to the user terminal as risk element content.

Still further, the apparatus further comprises:

the correction module is used for judging whether the risk element content exists in the corresponding audit segment according to the audit rule, if so, the risk element content is sent to the user terminal, after risk prompt is carried out, audit result information sent by the user terminal is received, and whether the determined risk element content is correct is judged according to the audit result information; if yes, adding one to the accurate value of the auditing rule corresponding to the risk element content; if not, subtracting one from the accurate value of the auditing rule corresponding to the risk element content; and sending the auditing rule with the accurate value smaller than the preset threshold value to the user terminal so as to enable the user terminal to modify the auditing rule.

Example III

The present embodiment also provides a computer device, such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including an independent server or a server cluster formed by a plurality of servers) that can execute a program. As shown in fig. 3, the computer device 30 of the present embodiment includes at least, but is not limited to: a memory 301, a processor 302, which may be communicatively connected to each other via a system bus. It is noted that FIG. 3 only shows a computer device 30 having components 301-302, but it should be understood that not all of the illustrated components are required to be implemented, and that more or fewer components may alternatively be implemented.

In this embodiment, the memory 301 (i.e., readable storage medium) includes flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, and the like. In some embodiments, the memory 301 may be an internal storage unit of the computer device 30, such as a hard disk or memory of the computer device 30. In other embodiments, the memory 301 may also be an external storage device of the computer device 30, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 30. Of course, the memory 301 may also include both internal storage units of the computer device 30 and external storage devices. In this embodiment, the memory 301 is typically used to store an operating system and various types of application software installed on the computer device 30, such as program codes of the text auditing apparatus of the second embodiment. In addition, the memory 301 can also be used to temporarily store various types of data that have been output or are to be output.

The processor 302 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 302 is generally used to control the overall operation of the computer device 30.

Specifically, in the present embodiment, the processor 302 is configured to execute a program of a text auditing method stored in the processor 302, where the program of the text auditing method is executed to implement the following steps:

The specific embodiment of the above method steps may refer to the first embodiment, and this embodiment is not repeated here.

Example IV

The present embodiment also provides a computer readable storage medium, such as a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., having stored thereon a computer program that when executed by a processor performs the following method steps:

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present invention are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A text review method, the method comprising:

judging whether risk element contents exist in the corresponding audit fragments according to the audit rules, if so, sending the risk element contents to the user terminal to carry out risk prompt;

the method comprises the steps of obtaining a classification model corresponding to the text type from a preset classification model library, splitting the text to be checked into a plurality of checking fragments by using the classification model, and adding a corresponding theme label for each checking fragment, wherein the method comprises the following steps:

collecting a salient word set of each topic label in the training sample set as a classification model corresponding to the text type;

2. The text auditing method according to claim 1, wherein the obtaining, according to the topic label of each auditing segment, auditing rules corresponding to the topic labels from a rule base corresponding to the text type includes:

3. The text auditing method of claim 1, wherein the auditing rules include: an audit element and audit sub-rules, and one audit element corresponds to one audit sub-rule;

4. The text auditing method according to claim 1, wherein after the judging whether there is risk element content in the corresponding auditing segment according to the auditing rule, if so, the risk element content is sent to the user terminal to perform risk prompting, the method further includes:

5. A text auditing device, the device comprising:

the judging module is used for judging whether the risk element content exists in the corresponding auditing fragment according to the auditing rule, if so, the risk element content is sent to the user terminal to carry out risk prompt;

the splitting module is further configured to:

6. A computer device, the computer device comprising: memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 4 when the program is executed by the processor.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 4.