CN116186263A

CN116186263A - Document detection method, device, computer equipment and computer readable storage medium

Info

Publication number: CN116186263A
Application number: CN202310189001.4A
Authority: CN
Inventors: 罗欢; 颜杰; 邬雨可; 余相君
Original assignee: Shanghai Himalaya Technology Co ltd
Current assignee: Shanghai Himalaya Technology Co ltd
Priority date: 2023-03-01
Filing date: 2023-03-01
Publication date: 2023-05-30

Abstract

The invention provides a document detection method, a device, a computer device and a computer readable storage medium, comprising: acquiring a document to be detected, and carrying out segmentation processing on the document to be detected to obtain a plurality of text segments; inputting a plurality of text segments into a pre-trained document classification model to obtain the probability that each text segment is a copyright violation document, the predicted value of a number vector and the probability of the predicted value; determining a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document; if the copyright violation evaluation value is smaller than or equal to a predetermined auditing threshold value, determining that the document to be detected is a non-copyright violation document; otherwise, based on the predicted value of the number vector and the probability of the predicted value, determining the similar document corresponding to each text segment, and calculating the segment overlapping degree of the text segment and the similar document.

Description

Document detection method, device, computer equipment and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a document detection method, a document detection device, a computer device, and a computer readable storage medium.

Background

With the continuous development of network technology, people are increasingly used to using the internet to share and acquire resources, for example, users can upload recorded sounds to some audio sharing platforms so that other users can listen to the recorded sounds, and infringement risks can occur if the sounds shared by the users are copyrighted contents. Thus, it is necessary to identify whether or not the sound uploaded by the user has infringement.

The existing recognition mode is a similar retrieval technology, voice recognition is firstly carried out on sound to obtain a document to be checked, then the document to be checked is subjected to one-to-one matching in a pre-established copyright document library, so that whether the document to be checked belongs to the copyright document is determined, the recognition mode can cause the problem of low recognition efficiency due to large matching quantity, and the pre-established copyright document library cannot cover all copyright resources, so that the problems of missed judgment and misjudgment can occur, and the recognition accuracy is low.

Disclosure of Invention

One of the purposes of the present invention is to provide a document detection method, apparatus, computer device and computer readable storage medium, so as to improve accuracy of document copyright detection results, where the embodiments of the present invention may be implemented as follows:

in a first aspect, the present invention provides a document detection method, the method comprising: acquiring a document to be detected, and carrying out segmentation processing on the document to be detected to obtain a plurality of text segments; inputting a plurality of text segments into a pre-trained document classification model to obtain the probability that each text segment is a copyright violation document, the predicted value of a numbering vector and the probability of the predicted value; determining a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document; if the copyright violation evaluation value is smaller than or equal to a predetermined auditing threshold value, determining that the document to be detected is a non-copyright violation document; otherwise, determining similar documents corresponding to each text segment based on the predicted value of the number vector and the probability of the predicted value, and calculating the segment overlapping degree of the text segment and the similar documents; wherein the audit threshold is determined by the document classification model; and determining whether the document to be detected is a non-copyright-violating document or not based on the probability that each text segment is a copyright-violating document and the segment overlapping degree.

In a second aspect, the present invention provides a document detection apparatus comprising: the device comprises an acquisition module, a prediction module, a determination module and a detection module; the acquisition module is used for acquiring a document to be detected, and carrying out segmentation processing on the document to be detected to obtain a plurality of text segments; the prediction module is used for inputting a plurality of text segments into a pre-trained document classification model to obtain the probability that each text segment is a copyright violation document, the predicted value of a number vector and the probability of the predicted value; the determining module is used for determining a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document; the detection module is used for determining that the document to be detected is a non-copyright-violating document if the copyright violation evaluation value is smaller than or equal to a predetermined auditing threshold value; otherwise, determining similar documents corresponding to each text segment based on the predicted value of the number vector and the probability of the predicted value, and calculating the segment overlapping degree of the text segment and the similar documents; wherein the audit threshold is determined by the document classification model; and the detection module is used for determining whether the document to be detected is a non-copyright-violating document based on the probability that each text segment is a copyright-violating document and the segment overlapping degree.

In a third aspect, the present invention provides a computer device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the image reasoning method of the first aspect.

In a fourth aspect, the present invention provides a readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image reasoning method as described in the first aspect.

The document detection method, the device, the computer equipment and the computer readable storage medium provided by the invention are characterized in that firstly, the acquired document to be detected is segmented to obtain a plurality of text segments, then, a document classification model is provided by the embodiment of the invention, the probability that each text segment of the document to be detected is a copyright violation document is predicted, the probabilities can be used for determining a copyright violation evaluation value of the document to be detected, meanwhile, the document classification model is also used for determining an auditing threshold value in the embodiment of the invention, and whether the document to be detected is the copyright violation document can be determined by combining the auditing threshold value and the copyright violation evaluation value of the document to be detected; meanwhile, the document classification model can also predict the number vector and probability corresponding to each text segment, and after determining that the document to be detected is likely to be a copyright violation document, the document to be detected can also be subjected to secondary detection, namely similar documents of the document to be detected and the segment overlapping degree between the similar documents are determined based on the number vector and the probability, and the document to be detected is subjected to secondary detection by combining the segment overlapping degree, so that the recognition accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a schematic flow chart of a training method of a document classification model provided by an embodiment of the invention;

FIG. 2 illustrates a schematic flow chart of determining an audit threshold using a document classification model provided by an embodiment of the present invention;

FIG. 3 shows a schematic flow chart of a document detection method provided by an embodiment of the invention;

FIG. 4 shows a schematic flow chart of step S35 provided by an embodiment of the present invention;

FIG. 5 shows a block diagram of a document detection apparatus provided by an embodiment of the present invention;

fig. 6 shows a block schematic diagram of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

In the description of the present invention, it should be noted that, if the terms "upper", "lower", "inner", "outer", and the like indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, or the azimuth or the positional relationship in which the inventive product is conventionally put in use, it is merely for convenience of describing the present invention and simplifying the description, and it is not indicated or implied that the apparatus or element referred to must have a specific azimuth, be configured and operated in a specific azimuth, and thus it should not be construed as limiting the present invention.

Furthermore, the terms "first," "second," and the like, if any, are used merely for distinguishing between descriptions and not for indicating or implying a relative importance.

It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.

Users can often share resources through a platform, such as sharing their own recorded sound, video, documents, etc., and infringement risks can occur if the resources shared by the users are copyrighted resources. Thus, it is highly necessary to identify whether the resource uploaded by the user has infringement.

The existing recognition mode is a similar retrieval technology, namely, for some resources which can be converted into documents to carry out copyright recognition, the resources are firstly converted into documents to be checked, for example, audio resources can be firstly subjected to voice recognition to obtain the documents to be checked, and then the documents to be checked are subjected to one-to-one matching in a pre-established copyright document library, so that whether the documents belong to the copyright documents is determined.

The identification method can cause the problem of low identification efficiency due to large matching quantity, and the problem that the copyright document library cannot cover all copyright resources and the problems of missed judgment and erroneous judgment possibly occur due to the fact that the copyright document library is built in advance, so that the identification accuracy is low.

In order to solve the technical problems, the invention provides a document detection method, which is used for improving the efficiency and accuracy of copyright resource identification, the document detection method firstly obtains a document classification model through the model training method provided by the embodiment of the invention, the document classification model can be used for predicting the probability that each text segment of a document to be detected is a copyright violation document, the probabilities can be used for determining a copyright violation evaluation value of the document to be detected, meanwhile, the document classification model is also used for determining an audit threshold value in the embodiment of the invention, and whether the document to be detected is the copyright violation document can be determined by combining the audit threshold value and the copyright violation evaluation value of the document to be detected; further, the document classification model can also predict the number vector and probability corresponding to each text segment, and after determining that the document to be detected is likely to be a copyright violation document, secondary detection needs to be performed on the document to be detected, namely similar documents of the document to be detected and the segment overlapping degree between the similar documents are determined based on the number vector and the probability, and secondary recognition is performed on the document to be detected by combining the segment overlapping degree, so that recognition accuracy is improved.

Firstly, in order to obtain the document classification model, referring to fig. 1, fig. 1 is a schematic flowchart of a training method of the document classification model according to the embodiment of the present invention, which may include the following steps:

s11: and taking the plurality of copyrighted documents and the plurality of non-copyrighted documents as training documents, configuring a first numerical label for the copyrighted documents, and configuring a second numerical label for the non-copyrighted documents.

In the embodiment of the invention, the copyrighted document needs to be collected by manual or crawler modes, and the non-copyrighted document can be a document which is already checked and is not infringed, and the document can be a document directly uploaded by a user or is converted from audio and video, such as audio and video texts, readings and the like.

The value tag is used to distinguish between a copyrighted document and a non-copyrighted document, for example, the first value tag is 1 and the second value tag is 0, which is not limited herein.

S12: and numbering each training document, and carrying out vector conversion on each document number based on a preset number to obtain a number vector of each training document.

The number vector is a target of model training in the embodiment of the invention, and since the document may have tens of thousands, hundreds of thousands or even more, and the model cannot run by directly using single-label multi-classification, the unique number vector is used instead of directly using the unique number as the target, each bit of the vector is a number, and multiple bits together form a global unique number.

The vector conversion according to the number may be: dividing the number by a preset number, taking the quotient as the first bit of the vector, taking the remainder as the second bit of the vector, for example, the number vector can be 2 bits, and each bit takes the value range of 0-999, namely, each bit adopts a thousand-system count, when the number of a document is 254123, the unique number vector is 254,123 after vector conversion is carried out on the document.

The first numerical value in the number vector is used for indicating the document number range corresponding to the code vector, and as can be seen from the vector conversion mode, after vector conversion is carried out from number 0 to number 999, the first numerical value of the vector is 0, and the first numerical value of the number vector corresponding to number 1000 to number 1999 is 1, so that after the number vector corresponding to one document is predicted, the corresponding document number range can be determined according to the first numerical value of the number vector, if similar document retrieval is required, the document number range is directly positioned, and compared with the retrieval mode in the prior art, the efficiency is greatly improved, for example, the first numerical value of the number vector of the predicted document to be detected is 1, the corresponding document number range is 1000 to 1999, and similar document detection is directly carried out in the document number range, so that the efficiency is faster.

S13: splitting each training document into a plurality of training text segments, coding each numerical value of the numbered vectors of the training documents corresponding to the training text segments, and taking the coded numbered vectors as numbered labels of the training text segments;

in the embodiment of the invention, each training document is split according to paragraphs or chapters, so that a plurality of training text segments can be obtained, each training text segment inherits the numerical label and the number corresponding to the training document, for example, the training document is a copyright document, the obtained plurality of training text segments have a first numerical label, and if the training document is a non-copyright document, the obtained plurality of training text segments have a second numerical label.

In an alternative embodiment, when splitting may be performed according to chapters and paragraphs, a fixed constraint length may be considered, for example, a complete sentence with a text length greater than 300 words may be split, that is, a sentence is added until it is just greater than 300 and stops as a segment, 300 is determined in the actual auditing experience, and after repeating above this value, a value that may be illegal may also be used, and other splitting manners may also be used, which are not limited herein.

For each training text segment, the embodiment of the application further encodes the corresponding numbered vector in the manner of onehot encoding, and uses the encoded numbered vector as the numbered labels of a plurality of training text segments, so that each training text segment has a numbered label y1 and a numerical label y2, and y2 is a first numerical value or a second numerical value.

For example, each bit of the number vector is 1000 at most, and for a certain number vector [254,0] is converted into [ [0 … 1,0 … 0], [0,0 … 0] ], the number of the code vectors in the first bit is 1000, wherein the 254 th number is 1, the rest is 0, the second bit and the like, and the description is omitted here.

S14: and inputting a plurality of training text segments corresponding to each training document and a serial number label, a first numerical label or a second numerical label corresponding to the training text segments into a multi-output classification model for training until reaching the training ending condition, and obtaining the document classification model.

In the implementation of the invention, the loss function of model training is established based on the difference between the actual value and the predicted value of y1 and y2, and the training ending condition can be that the iteration number reaches the preset iteration number, or the loss function converges, etc., which is not limited herein.

After obtaining the trained document classification model, the following pre-uses the document classification model to determine the audit threshold (which may also be understood as the no-audit threshold) in the embodiment of the present invention, please refer to fig. 2, fig. 2 is a schematic flowchart of determining the audit threshold using the document classification model provided in the embodiment of the present invention:

s21: taking the acquired copyright violation document and the non-copyright violation document as original documents to form an original document data set;

in the embodiment of the invention, the copyright violation document and the non-copyright violation document are document data subjected to copyright verification in an actual service scene, for example, real sound is collected for voice recognition, the obtained document is manually verified to determine whether the document is the copyright violation document or the non-copyright document, and the non-copyright violation document and the non-copyright document collected in the model training process have the same collection scene but different collection time, so that the document data used for determining the verification threshold value can be prevented from being revealed to a document classification model, and the model prediction result is inaccurate to influence the credibility of the verification threshold value.

S22: obtaining a copyright violation score corresponding to each original document by using a document classification model;

in the embodiment of the application, each original document obtained in step S21 is split according to the same splitting manner as that in the training method to obtain text segments corresponding to each original document, and then a document classification model is utilized to obtain probabilities corresponding to each text segment of each original document, and the probabilities can be used for determining copyright violation scores corresponding to each original document.

In an alternative embodiment, the maximum probability corresponding to the text segments may be used as a copyright violation score, or the multiple target probabilities may be determined sequentially in order of the probabilities from the largest to the smallest, and an average value of the multiple target probabilities may be used as the copyright violation score.

S23: and sequentially determining whether each copyright violation score is smaller than a preset threshold value according to the sequence from small to large of the copyright violation score corresponding to each original document, and taking the last copyright violation score smaller than the preset threshold value as an auditing threshold value.

It can be understood that, since the copyright violations belong to major violations, the coverage is required as much as possible, and the passing can be avoided, then the passing recognition accuracy is required to be greater than a certain threshold, for example, 99.99%, in a specific manner, the copyright violation evaluation value is taken from small to large until the accuracy of the taken out part is smaller than the specified threshold, and the copyright violation score at this time is taken as the passing probability threshold, that is, the last copyright violation score smaller than the preset threshold is taken as the auditing threshold.

After obtaining the document classification model and the audit threshold, the document detection method provided by the embodiment of the present invention is described below, referring to fig. 3, and fig. 3 is a schematic flowchart of the document detection method provided by the embodiment of the present invention, which may include the following steps:

s31: and acquiring a document to be detected, and carrying out segmentation processing on the document to be detected to obtain a plurality of text segments.

In the embodiment of the invention, the document to be detected can be a document uploaded by a user, or can be an audio uploaded by the user and converted through voice recognition, and the document to be detected is not limited herein.

The implementation of the segmentation processing of the document to be detected is the same as the splitting way of the training document in the training process, and is not repeated here.

S32: inputting a plurality of text segments into a pre-trained document classification model to obtain the probability that each text segment is a copyright violation document, the predicted value of the number vector and the probability of the predicted value.

S33: and determining a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is the copyright violation document.

In the embodiment of the present invention, the manner of determining the copyright violation evaluation value is the same as the manner of calculating the copyright violation score in the process of determining the auditing threshold value, and will not be described here again.

S34: if the copyright violation evaluation value is smaller than or equal to a predetermined auditing threshold value, determining that the document to be detected is a non-copyright violation document;

s35: otherwise, determining similar documents corresponding to each text segment based on the predicted value of the number vector and the probability of the predicted value, and calculating the segment overlapping degree of the text segment and the similar documents; wherein the audit threshold is determined by a document classification model.

In the embodiment of the invention, if the copyright violation evaluation value is smaller than or equal to the predetermined auditing threshold value, determining that the document to be detected is a non-copyright violation document, otherwise, adopting secondary detection to determine whether the document to be detected is a non-copyright violation document, and in the secondary detection process, determining a plurality of similar documents of the document to be detected by using the number vector and the probability predicted by the document classification model, and then calculating the direct segment overlapping degree of the document to be detected and the similar documents so as to assist related personnel to audit.

It can be understood that the similar documents are the copyrighted documents in the copyrighted library, each copyrighted document has a number, and for the document to be detected, the coding vector and probability of each corresponding text segment can be predicted through the document classification model, so that the similar document of the document to be detected can be determined.

S36: and determining whether the document to be detected is a non-copyright-violation document based on the probability that each text segment is a copyright-violation document.

In one embodiment, the copyright violation probability of each text segment and the segment overlapping degree between the text segment and a plurality of similar documents can be submitted to an auditor, so that the auditor can be assisted in combining the information to perform secondary detection on the document to be detected, and the detection accuracy is improved.

In another embodiment, the overlapping degree threshold value may be directly set, and then it is determined whether the overlapping degree of the segments corresponding to the text segments exceeds the overlapping degree threshold value, so as to determine whether the document to be detected belongs to the copyright violation document.

In summary, according to the document detection method provided by the invention, firstly, the acquired document to be detected is segmented to obtain a plurality of text segments, then, a document classification model is provided through the embodiment of the invention, the probability that each text segment of the document to be detected is a copyright violation document is predicted, the probabilities can be used for determining a copyright violation evaluation value of the document to be detected, meanwhile, the document classification model is also used for determining an audit threshold in the embodiment of the invention, and whether the document to be detected is the copyright violation document can be determined by combining the audit threshold and the copyright violation evaluation value of the document to be detected; further, the document classification model can also predict the number vector and probability corresponding to each text segment, and after determining that the document to be detected is likely to be a copyright violation document, the similar documents of the document to be detected and the segment overlapping degree between the similar documents are determined based on the number vector and the probability, and the document to be detected is secondarily identified by combining the segment overlapping degree, so that the identification accuracy is improved.

In an alternative implementation manner, an implementation manner is provided for the above step S35 in the embodiment of the present invention, please refer to fig. 4, and fig. 4 is a schematic flowchart of step S35 provided by the embodiment of the present invention, that is, step S35 may include the following steps:

s35-1: and combining the predicted values at each position in the number vector to obtain a document number predicted value, and taking the product of probabilities corresponding to the combined predicted values as the probability corresponding to the document number predicted value.

In the embodiment of the invention, for the predicted result of the number vector, the probability of the final number is obtained by using a product mode, for example, the number vector has two bits, when each bit is 1000 at most, the predicted result of each bit also has 1000, and the first bit predicted result is assumed to have 256, and the probability is 0.8;222, the probability is 0.1, the second bit has 125 predicted results, and the probability is 0.7;23, the probability is 0.2, the number and probability obtained finally are: 256125, probability is 0.8 x 0.7;256023, probability is 0.8 x 0.2;222125, probability is 0.1 x 0.7;222023, the probability is 0.1 x 0.2, and so on, a plurality of document numbers and the probability of each document number can be obtained.

S35-2: and determining a preset number of target document number predicted values according to the probability corresponding to all the document number predicted values from large to small, and taking the documents corresponding to the target document number predicted values as similar documents in a preset copyright document library.

In the embodiment of the present invention, a plurality of document numbers and respective probabilities can be obtained through step S34-1, and then the document numbers can be sorted according to the probability value, and then a preset number of numbers are taken as target numbers according to the direction from large to small, and the documents of the target numbers in the copyright library are determined as similar documents.

S35-3: and taking the ratio of the number of overlapped words of each text segment and each similar document to the total number of words corresponding to the text segment as the segment overlapping degree.

In the embodiment of the invention, the segment overlapping degree can be understood as the ratio of the number of overlapping words of the text segment and the similar document to the total number of words of the text segment, and can be determined by using the longest public subsequence method, which is not described herein.

By the aid of the method, auxiliary auditing information such as the segment overlapping degree can be provided for auditors, and accuracy of detection results is guaranteed.

Based on the same inventive concept, the embodiment of the present invention further provides a document detection apparatus, which is configured to execute the corresponding steps in the foregoing method embodiment and each possible implementation manner, and an implementation manner of the document detection apparatus is given below. Referring to fig. 5, fig. 5 is a schematic block diagram of a document detection apparatus according to an embodiment of the present invention. The document detection apparatus 500 includes: an acquisition module 510, a prediction module 520, a determination module 530, and a detection module 540;

the obtaining module 510 is configured to obtain a document to be detected, and perform segmentation processing on the document to be detected to obtain a plurality of text segments;

the prediction module 520 is configured to input a plurality of text segments into a pre-trained document classification model, to obtain a probability that each text segment is a copyright violation document, a predicted value of a numbering vector, and a probability of the predicted value;

a determining module 530, configured to determine a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document;

a detection module 540, configured to determine that the document to be detected is a non-copyright-violation document if the copyright-violation evaluation value is less than or equal to a predetermined audit threshold; otherwise, determining similar documents corresponding to each text segment based on the predicted value of the number vector and the probability of the predicted value, and calculating the segment overlapping degree of the text segment and the similar documents; wherein the audit threshold is determined by a document classification model;

the detecting module 540 is configured to determine whether the document to be detected is a non-copyright-violating document based on the probability that each text segment is a copyright-violating document and the segment overlap degree.

It will be appreciated that the obtaining module 510, the predicting module 520, the determining module 530 and the detecting module 540 may cooperatively perform the steps in fig. 3 to achieve the corresponding technical effects.

Optionally, the detection module 530 is specifically configured to: combining predicted values at each position in the number vector to obtain a document number predicted value, and taking the product of probabilities corresponding to the combined predicted values as the probability corresponding to the document number predicted value; sequentially determining a preset number of target document number predicted values according to the probability corresponding to all the document number predicted values from large to small, and taking the documents corresponding to the target document number predicted values as similar documents in a preset copyright document library; and taking the ratio of the number of overlapped words of each text segment and each similar document to the total number of words corresponding to the text segment as the segment overlapping degree.

Optionally, the document detection apparatus 500 may further include a training module, configured to use a plurality of copyrighted documents and a plurality of non-copyrighted documents as training documents, configure a first numerical label for the copyrighted documents, and configure a second numerical label for the non-copyrighted documents; numbering each training document, and carrying out vector conversion on each document number based on a preset system number to obtain a numbered vector of each training document; the first numerical value in the number vector is used for indicating a document number range corresponding to the code vector; splitting each training document into a plurality of training text segments, coding each numerical value of the numbered vectors of the training documents corresponding to the training text segments, and taking the coded numbered vectors as numbered labels of the training text segments; and inputting a plurality of training text segments corresponding to each training document and a serial number label, a first numerical label or a second numerical label corresponding to the training text segments into a multi-output classification model for training until reaching the training ending condition, and obtaining the document classification model.

Optionally, the determining module 530 is specifically configured to use the maximum probabilities corresponding to the plurality of text segments as the copyright violation evaluation values of the document to be detected, or sequentially determine the plurality of target probabilities in order of the probabilities from the big to the small, and use an average value of the plurality of target probabilities as the violation evaluation values.

Optionally, the determining module 530 is further specifically configured to use the obtained copyright violation document and the non-copyright-violation document as original documents to form an original document data set; obtaining a copyright violation score corresponding to each original document by using a document classification model; and sequentially determining whether each copyright violation score is smaller than a preset threshold value according to the sequence from small to large of the copyright violation score corresponding to each original document, and taking the last copyright violation score smaller than the preset threshold value as an auditing threshold value.

Optionally, the obtaining module 510 is specifically configured to obtain audio to be audited, and perform voice recognition on the audio to obtain a document to be detected.

It should be noted that, in the above embodiments of the present invention, the division of the modules is merely schematic, and there may be another division manner in actual implementation, and in addition, each functional unit in each embodiment of the present invention may be integrated in one processing unit, or may exist separately and physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a readable storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned readable storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The embodiment of the invention also provides a computer device for executing the method for detecting the file in the embodiment of the invention, referring to fig. 6, fig. 6 shows a block schematic diagram of the computer device provided in the embodiment of the invention. The computer device 600 may be a server, a personal computer, an edge gateway, etc., comprising a processor 601, a memory 602 and a bus 604, a communication interface 603, the processor 601 being connected to the memory 602 via the bus 604.

Alternatively, bus 604 may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The buses may be divided into address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 6, but not only one bus or one type of bus.

In the embodiments of the present application, the processor 602 may be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, where the methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in a hardware processor for execution, or in a combination of hardware and software modules in the processor for execution. The software module may be located in the memory 601 and the processor 602 reads the program instructions in the memory 601 and in combination with its hardware performs the steps of the method described above.

In the embodiment of the present application, the memory 601 may be a nonvolatile memory, such as a hard disk (HDD) or a Solid State Drive (SSD), or may be a volatile memory (RAM). The memory may also be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory in the embodiments of the present application may also be a circuit or any other device capable of implementing a memory function, for storing instructions and/or data.

The memory 601 may be used to store software programs and modules, such as instructions/modules of the image flaw detection apparatus 300 provided in the embodiments of the present invention, and may be stored in the memory 601 in the form of software or firmware (firmware) or be solidified in an Operating System (OS) of the computer device 600, and the processor 602 executes the software programs and modules stored in the memory 601, thereby performing various functional applications and data processing. The communication interface 603 may be used for communication of signaling or data with other node devices.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again.

It is to be understood that the configuration shown in fig. 6 is illustrative only, and that the computer device 600 may also include more or fewer components than shown in fig. 6, or have a different configuration than shown in fig. 6. The components shown in fig. 6 may be implemented in hardware, software, or a combination thereof.

Based on the above embodiments, the present invention also provides a readable storage medium having stored therein a computer program which, when executed by a computer, causes the computer to execute the document detection method provided in the above embodiments.

Based on the above embodiments, the present invention also provides a computer program, which when run on a computer, causes the computer to execute the document detection method provided in the above embodiments.

Based on the above embodiments, the present invention further provides a chip, where the chip is configured to read a computer program stored in a memory, and is configured to perform the document detection method provided in the above embodiments.

Embodiments of the present invention also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the document detection method provided in the above embodiments.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by instructions. These instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The present invention is not limited to the above embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. A document detection method, the method comprising:

acquiring a document to be detected, and carrying out segmentation processing on the document to be detected to obtain a plurality of text segments;

inputting a plurality of text segments into a pre-trained document classification model to obtain the probability that each text segment is a copyright violation document, the predicted value of a numbering vector and the probability of the predicted value;

determining a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document;

if the copyright violation evaluation value is smaller than or equal to a predetermined auditing threshold value, determining that the document to be detected is a non-copyright violation document; otherwise, determining similar documents corresponding to each text segment based on the predicted value of the number vector and the probability of the predicted value, and calculating the segment overlapping degree of the text segment and the similar documents; wherein the audit threshold is determined by the document classification model;

and determining whether the document to be detected is a non-copyright-violating document or not based on the probability that each text segment is a copyright-violating document and the segment overlapping degree.

2. The document detection method according to claim 1, wherein determining a similar document corresponding to each text segment based on a predicted value of the number vector and a probability of the predicted value, and calculating a segment overlap degree of the text segment and the similar document, comprises:

combining the predicted values at each position in the number vector to obtain a document number predicted value, and taking the product of probabilities corresponding to the combined predicted values as the probability corresponding to the document number predicted value;

sequentially determining a preset number of target document number predicted values according to the probabilities corresponding to all the document number predicted values from large to small, and taking the documents corresponding to the target document number predicted values as the similar documents in a preset copyright document library;

and taking the ratio of the number of overlapped words of each text segment and each similar document to the total number of words corresponding to the text segment as the segment overlapping degree.

3. The document detection method of claim 1, wherein the document classification model is trained by:

using a plurality of copyright documents and a plurality of non-copyright documents as training documents, configuring a first numerical value tag for the copyright documents, and configuring a second numerical value tag for the non-copyright documents;

numbering each training document, and carrying out vector conversion on each document number based on a preset system number to obtain a numbering vector of each training document; the first numerical value in the numbering vector is used for indicating a document numbering range corresponding to the numbering vector;

splitting each training document into a plurality of training text segments, coding each numerical value of the number vector of the training document corresponding to the training text segments, and taking the coded number vector as a number label of the training text segments;

and inputting a plurality of training text segments corresponding to each training document and the serial number labels, the first numerical labels or the second numerical labels corresponding to the training text segments into a multi-output classification model for training until reaching a training ending condition, and obtaining the document classification model.

4. The document detection method according to claim 1, wherein determining the copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document, comprises:

and taking the maximum probability corresponding to the text segments as the copyright violation evaluation value of the document to be detected, or sequentially determining a plurality of target probabilities according to the sequence of the probabilities from large to small, and taking the average value of the target probabilities as the violation evaluation value.

5. The document detection method of claim 4, wherein the audit threshold is determined by:

taking the acquired copyright violation document and the non-copyright violation document as original documents to form an original document data set;

obtaining a copyright violation score corresponding to each original document by using the document classification model;

and sequentially determining whether each copyright violation score is smaller than a preset threshold value according to the sequence from small to large of the copyright violation score corresponding to each original document, and taking the last copyright violation score smaller than the preset threshold value as the auditing threshold value.

6. The document detection method according to claim 1, wherein obtaining a document to be detected includes:

and acquiring audio to be audited, and performing voice recognition on the audio to obtain the document to be detected.

7. A document detection apparatus, comprising: the device comprises an acquisition module, a prediction module, a determination module and a detection module;

the acquisition module is used for acquiring a document to be detected, and carrying out segmentation processing on the document to be detected to obtain a plurality of text segments;

the prediction module is used for inputting a plurality of text segments into a pre-trained document classification model to obtain the probability that each text segment is a copyright violation document, the predicted value of a number vector and the probability of the predicted value;

the determining module is used for determining a copyright violation evaluation value corresponding to the document to be detected according to the probability that each text segment is a copyright violation document;

the detection module is used for determining that the document to be detected is a non-copyright-violating document if the copyright violation evaluation value is smaller than or equal to a predetermined auditing threshold value; otherwise, determining similar documents corresponding to each text segment based on the predicted value of the number vector and the probability of the predicted value, and calculating the segment overlapping degree of the text segment and the similar documents; wherein the audit threshold is determined by the document classification model;

and the detection module is used for determining whether the document to be detected is a non-copyright-violating document based on the probability that each text segment is a copyright-violating document and the segment overlapping degree.

8. The document detection apparatus according to claim 7, wherein the detection module is specifically configured to:

9. A computer device comprising a processor and a memory, the memory storing a computer program executable by the processor, the processor being executable to implement the document detection method of any one of claims 1 to 6.

10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the document detection method according to any one of claims 1 to 6.