CN117172220B

CN117172220B - Text similarity information generation method, device, equipment and computer readable medium

Info

Publication number: CN117172220B
Application number: CN202311444882.6A
Authority: CN
Inventors: 代鲁峰; 王显岭; 任志鹏; 董亮; 王丽君; 陈曦; 张晓枫; 隋志巍; 王志波; 陈恩光; 宋峰旭; 银天伟; 王娟; 张小睦; 柳雁; 毛硕; 张硕
Original assignee: State Grid Information and Telecommunication Co Ltd; Beijing Guodiantong Network Technology Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Beijing Guodiantong Network Technology Co Ltd
Priority date: 2023-11-02
Filing date: 2023-11-02
Publication date: 2024-02-02
Anticipated expiration: 2043-11-02
Also published as: CN117172220A

Abstract

Embodiments of the present disclosure disclose text similarity information generation methods, apparatuses, devices, and computer readable media. One embodiment of the method comprises the following steps: acquiring an input file set and an input file; carrying out file analysis processing on the input file set and the input file; removing text contents which meet preset similar conditions with the contents between the input texts from each input text to generate input texts; for every two investment files, the generating step is performed: determining file summary similarity; determining text similarity in response to determining that the document summary similarity is greater than the first value; generating a file similarity in response to determining that the text similarity is greater than the second numerical value; and sending the file similarity and file information corresponding to the two input files to an auditing end. According to the embodiment, the file similarity among the released files can be accurately and efficiently generated, and the effectiveness of releasing is guaranteed.

Description

Text similarity information generation method, device, equipment and computer readable medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technology, and in particular, to a method, an apparatus, a device, and a computer readable medium for generating text similarity information.

Background

Currently, prior to project development, project handlers are often selected by bidding. For similarity determination of drop-in documents (e.g., submitted bid documents), the following are typically employed: the similarity between the submitted investment files is determined by a method manually recognized by the relevant professional to perform an investment canceling operation (e.g., a canceling bid operation) for the investment files having higher similarity.

However, the inventors have found that when the above-described manner is adopted, there are often the following technical problems:

firstly, the efficiency is low, and the similarity between two input files cannot be accurately determined, so that the problems of surrounding marks and serial marks often exist in the input process;

secondly, the image feature extraction of input images has deviation, and more neural networks are often needed to extract feature information, so that the occupation of a calculation memory is large, and the calculation efficiency is low.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, may contain information that does not form the prior art that is already known to those of ordinary skill in the art in this country.

Disclosure of Invention

The disclosure is in part intended to introduce concepts in a simplified form that are further described below in the detailed description. The disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose text similarity information generation methods, apparatuses, devices, and computer readable media to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a text similarity information generating method, including: acquiring an input file set and an input file; respectively carrying out file analysis processing on the input file set and the input file to generate an input text set and an input text; removing text contents which meet preset similar conditions with the contents among the input texts from each input text in the input text set to generate a removed input text as input text; for each two investment files in the resulting investment file set, the following generating steps are performed: determining the file outline similarity between the two input files; determining text similarity between two input texts corresponding to the two input files in response to determining that the file summary similarity is greater than a first numerical value; generating file similarity for the two input files in response to determining that the text similarity is greater than a second value; and sending the file similarity and the file information corresponding to the two input files to an auditing end to generate text similarity information for the two input files.

In a second aspect, some embodiments of the present disclosure provide a text similarity information generating apparatus, including: an acquisition unit configured to acquire an input file set and an input file; the analysis processing unit is configured to respectively perform file analysis processing on the input file set and the input file to generate an input text set and an input text; a removing unit configured to remove text content satisfying a preset similarity condition with the content between the input texts from each input text in the input text set, so as to generate a removed input text as input text; an execution unit configured to execute the following generation steps for each two investment files in the obtained investment file set: determining the file outline similarity between the two input files; determining text similarity between two input texts corresponding to the two input files in response to determining that the file summary similarity is greater than a first numerical value; generating file similarity for the two input files in response to determining that the text similarity is greater than a second value; and sending the file similarity and the file information corresponding to the two input files to an auditing end to generate text similarity information for the two input files.

In a third aspect, some embodiments of the present disclosure provide an electronic device comprising: one or more processors; a storage device having one or more programs stored thereon, which when executed by one or more processors, cause the one or more processors to implement the method as described in any of the implementations of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a method as described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following advantageous effects: according to the text similarity information generation method, file similarity among various released files can be accurately and efficiently generated, and the effectiveness of releasing is guaranteed. Specifically, the reason for the insufficient accuracy of the related file similarity is that: the efficiency is not only low, but also the similarity between two input files cannot be accurately determined, so that the problems of surrounding marks and serial marks often exist in the input process. Based on this, the text similarity information generation method of some embodiments of the present disclosure first obtains the input file set and the input file, and then determines whether the occurrence of the condition of the surrounding mark and the string mark exists. And then, respectively carrying out file analysis processing on the input file set and the input file to generate an input text set and an input text so as to facilitate the subsequent similarity analysis of the texts. And then, removing text contents which meet the preset similar conditions with the contents between the input texts from each input text in the input text set so as to generate the removed input text as input text. Here, by removing text contents satisfying a preset similarity condition, it is ensured that the similarity determination between two investment files is not affected. If the text content meeting the preset similarity condition is not removed, the similarity degree between the files cannot be accurately determined subsequently. Further, for each two input files in the obtained input file set, the following generation steps are performed: first, determining the file outline similarity between the two input files. Here, on the premise of a small calculation amount, the similarity between the files is preliminarily determined from the viewpoint of file overview, and the similarity between the files can be effectively determined. And a second step of determining the text similarity between the two input texts corresponding to the two input files in response to determining that the file summary similarity is greater than a first value. Here, on the premise that the file summary similarity is larger than the first numerical value, file similarity calculation between two files is further performed, so that the similarity between the files can be determined more accurately. And thirdly, generating file similarity for the two input files in response to determining that the text similarity is larger than a second numerical value. On the premise that the text similarity is larger than the second numerical value, file similarity calculation between the two files is further performed, so that the similarity between the files can be determined more accurately. And step four, sending the file similarity and the file information corresponding to the two input files to an auditing end to generate similar information for the two input files. In conclusion, the similarity between the two input files is determined step by step in a multi-layer manner, so that the file similarity between the input files is accurately and efficiently generated, and the effectiveness of input is ensured.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is a flow chart of some embodiments of a text similarity information generation method according to the present disclosure;

FIG. 2 is a schematic diagram of the structure of some embodiments of a text similarity information generating apparatus according to the present disclosure;

fig. 3 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings. Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Referring to fig. 1, a flow 100 of some embodiments of a text similarity information generation method according to the present disclosure is shown. The text similarity information generation method comprises the following steps:

And step 101, acquiring an input file set and an input file.

In some embodiments, the execution body of the text similar information generating method may acquire the input file set and the input file through a wired connection manner or a wireless connection manner. Wherein, each input file in the input file set may be a file for a target input item. For example, for a target project being a building project, each project file may be a building related file. In practice, the input file may be a bid file, and the corresponding target input item is a bid item. The release file may be information released for the target release item. In practice, the release file may be a bidding file.

And 102, respectively carrying out file analysis processing on the input file set and the input file to generate an input text set and an input text.

In some embodiments, the execution body may perform file parsing processing on the input file set and the input file set, to generate an input text set and an input text. The input files in the input file set and the input texts in the input text set have a one-to-one correspondence. The drop text may be text content corresponding to the drop file. The release text may be text content corresponding to the release file.

As an example, the execution subject may perform file analysis processing on the input file set and the input file, respectively, using OCR (Optical Character Recognition ), itextpdf, decompression, or the like, to generate an input text set and an input text.

And 103, removing text contents which meet the preset similar conditions with the contents between the input texts from each input text in the input text set to generate removed input texts as input texts.

In some embodiments, the execution body may remove text content that satisfies a preset similarity condition with the content between the drop texts from each drop text in the drop text set, so as to generate a removed drop text as the drop text. The preset similarity condition can be text content with sentence similarity between input text and input text being more than 70%.

Step 104, for each two input files in the obtained input file set, executing the following generation steps:

step 1041, determining a file summary similarity between the two input files.

In some embodiments, the execution body may determine a profile overview similarity between the two investment profiles. The file summary similarity may be content similarity between file summaries. File overview similarity may be a number between 0 and 1.

As an example, first, the above-described execution body may determine file summary semantic content feature information of two input files corresponding to two file summaries using a transducer model. Then, cosine similarity between the two file summary semantic content feature information is determined as the file summary similarity.

In step 1042, in response to determining that the file summary similarity is greater than the first value, determining that the two input files correspond to a text similarity between two input texts.

In some embodiments, in response to determining that the file summary similarity is greater than a first value, the execution body may determine that the two drop files correspond to a text similarity between two drop texts. The text similarity may be a similarity of text contents. The first value may be a preset value. For example, the first value may be 0.5.

In some optional implementations of some embodiments, the determining the text similarity between the two input texts corresponding to the two input files may include the following steps:

and a first step of performing text word segmentation processing on the two input texts to obtain a first word set and a second word set respectively.

As an example, the execution body may perform text word segmentation processing on the two input texts by using the barker word segmentation, so as to obtain a first word set and a second word set respectively.

Second, for each word in the first word set, the following first processing step is performed:

and a first sub-step of determining the word weight of the word in the first word set as the first word weight. Wherein the first word weight may be a value between 0 and 1.

As an example, the execution body may determine a word weight of the word in the first word set as the first word weight using a TF-IDF method.

And a second sub-step of generating a word hash signature value for the word as a first word hash signature value.

As an example, the execution subject may generate a word hash signature value for the word as the first word hash signature value using a Digital Digest method (Digital Digest) or a Digital fingerprint method (Digital Finger Print).

And thirdly, multiplying the first word hash signature value and the first word weight to obtain a first multiplication result.

And a fourth step of adding the first multiplication results of the obtained first multiplication result sets to obtain a first addition result.

And fifthly, carrying out hash signature processing on the first addition result to generate a second word hash signature value.

Sixth, for each word in the second word set, the following second processing step is performed:

and a first sub-step of determining the word weight of the word in the second word set as the second word weight.

As an example, the execution body may determine a word weight of the word in the second word set as the second word weight using a TF-IDF method.

And a second sub-step of generating a word hash signature value for the word as a second word hash signature value.

As an example, the execution subject may generate a word hash signature value for the word as the second word hash signature value using a digital digest method or a digital fingerprint method.

And a third substep, multiplying the second word hash signature value and the second word weight to obtain a second multiplication result.

And seventh, adding the second multiplication results of the second multiplication results set to obtain a second addition result.

And eighth step, hash signature processing is carried out on the second addition result so as to generate a fourth word hash signature value.

And a ninth step of determining a difference between the second word hash signature value and the fourth word hash signature value to obtain a difference result.

And tenth, dividing the difference result by the binary bit number corresponding to the first addition result to obtain a division result. Wherein the binary digit may be the digit of the hash signature value initialization binary. For example, the binary number may be 128 or 64.

And eleventh, subtracting the division result from the target value to obtain a subtraction result, wherein the subtraction result is used as the text similarity.

In some alternative implementations of some embodiments, the two input texts include: a first drop-in text and a second drop-in text. The first input text and the second input text may be two different texts corresponding to the file.

Optionally, the determining the text similarity between the two input texts corresponding to the two input files may include the following steps:

the first step, according to the file title information, title segmentation is carried out on the first input text and the second input text, and a first input text segmentation sequence and a second input text segmentation sequence are respectively obtained. The file title information may be level information of each title level to which the file is laid out. For example, the file header information may include, but is not limited to, at least one of: primary title, secondary title, tertiary title, quaternary title, and penta title.

As an example, the execution body may segment the first drop text and the second drop text according to a first-level title, a second-level title, a third-level title, a fourth-level title, and a fifth-level title, to obtain a first drop text segment sequence and a second drop text segment sequence, respectively.

And a second step of determining a first drop text vector corresponding to each first drop text segment in the first drop text segment sequence and determining a second drop text vector corresponding to each second drop text segment in the second drop text segment sequence. Wherein the first drop text vector may be in the form of a vector of first drop text segments. The second drop text vector may be in the form of a vector of second drop text segments.

For example, for each first input text segment in the sequence of first input text segments, first, the execution body may perform text segmentation on the first input text segment to obtain a text word set. Then, word embedding processing is carried out on each text word in the text word set so as to generate word embedding vectors, and a word embedding vector set is obtained. Then, vector fusion is carried out on each word embedding vector in the word embedding vector set, and a fusion vector is obtained and is used as a first input text vector.

And thirdly, inputting each first input text vector in the obtained first input text vector sequence into a multi-layer series connected first recurrent neural network model to generate a first text feature information sequence. The first text feature information in the first text feature information sequence and the first input text vector in the first input text vector sequence have a one-to-one correspondence. The first text feature information may characterize text segment semantic feature information corresponding to the first drop-in text segment.

And a fourth step of inputting each second input text vector in the obtained second input text vector sequence into a multi-layer series connected second recurrent neural network model to generate a second text feature information sequence. And the second text characteristic information in the second text characteristic information sequence and the second input text vector in the second input text vector sequence have a one-to-one correspondence. The second text feature information may characterize text segment semantic feature information corresponding to the second drop-in text segment.

And fifthly, determining the feature similarity between each first text feature information in the first text feature information sequence and each second text feature information in the second text feature information sequence to obtain a feature similarity set. The feature similarity may characterize an information similarity between the first text feature information and the second text feature information. Specifically, the feature similarity may be cosine similarity.

And sixthly, generating at least one feature information group according to the feature similarity set, wherein the feature similarity between the first text feature information and the second text feature information included in the feature information group is larger than a third numerical value. The feature information group includes: first text feature information and second text feature information. The third value may be a preset value. For example, the third value may be 0.74.

As an example, the executing body may screen out the feature information set with the feature similarity greater than the third value from the feature information set, to obtain at least one feature information set.

Seventh, for each of the at least one feature information group, performing the information generating step of:

and a substep 1, inputting the first text feature information and the second text feature information included in the feature information set into a semantic information generation model to generate the first text semantic information and the second text semantic information. The semantic information generation model may be a neural network model that generates text semantic information. The first text semantic information may represent text semantics corresponding to the first text feature information. The second text semantic information may characterize text semantics corresponding to the second text feature information.

And 2, determining the semantic similarity corresponding to the first text semantic information and the second text semantic information. Wherein the semantic similarity may be a value between 0 and 1.

As an example, first, the execution subject may extract a first semantic keyword set corresponding to the first text semantic information and a second semantic keyword set corresponding to the second text semantic information. And then, determining the keyword similarity between the first semantic keyword set and the second semantic keyword set, and obtaining the keyword set similarity as the semantic similarity.

And 3, generating determination information for representing whether the first text feature information and the second text feature information in the feature information group are similar feature information according to the semantic similarity. Wherein the determination information may be one of: and the first text characteristic information and the second text characteristic information in the characteristic information group are characterized as information of similar characteristic information, and the first text characteristic information and the second text characteristic information in the characteristic information group are characterized as not being similar characteristic information.

As an example, in response to determining that the semantic similarity is greater than 0.5, information is generated that characterizes the first text feature information and the second text feature information in the set of feature information as similar feature information. In response to determining that the semantic similarity is less than or equal to 0.5, generating information that characterizes the first text feature information and the second text feature information in the set of feature information as not being similar feature information.

Eighth, determining the number of target feature information groups in the at least one feature information group according to the obtained at least one piece of determination information. Wherein, the determining information in the at least one determining information has a one-to-one correspondence with the characteristic information group in the at least one characteristic information group. The feature information set may be a feature information set composed of the first text feature information and the second text feature information. The target feature information set is a feature information set in which the correspondence determination information is information characterizing that the first text feature information and the second text feature information in the feature information set are similar feature information. For example, the number of groups may be 4.

And ninth, generating the text similarity according to the number of the groups.

As an example, first, the above-described execution subject may determine that at least one feature information group includes the number of feature information groups as the feature information group number. Then, the number of groups is divided by the number of feature information groups to obtain the text similarity.

Optionally, the generating the text similarity according to the number of groups may include the steps of:

first, according to the number of groups, a first text similarity is generated.

As an example, first, the above-described execution subject may determine that at least one feature information group includes the number of feature information groups as the feature information group number. Then, the number of groups is divided by the number of feature information groups to obtain the initial text similarity. And finally, adding the initial text similarity and the target value to obtain an added value serving as the first text similarity.

And a second step of determining a first target input text and a second target input text corresponding to the target feature information groups for each of the at least one target feature information group.

Third, a first knowledge graph aiming at the first target input text and a second knowledge graph aiming at the second target input text are generated.

As an example, first, the execution subject may extract an entity information set, an entity-attribute association relationship information set, an entity attribute information set, and an entity attribute set associated with the first target input text. And then, generating a first knowledge graph according to the entity information set, the entity association relation information set, the entity attribute information set and the entity attribute set. Specifically, the construction of the second knowledge-graph may be referred to the construction of the first knowledge-graph. Wherein, the node corresponding to the first knowledge graph is one of the following: entity information, entity attributes. The corresponding edge of the first knowledge graph is one of the following: entity association relationship information, and entity and attribute association relationship information. The value of the edge corresponding to the first knowledge graph is entity attribute information. The related structure of the second knowledge-graph may be referred to as the structure of the first knowledge-graph.

And step four, determining the number of first edges corresponding to the first knowledge graph and the number of second edges corresponding to the second knowledge graph.

And fifthly, determining entity similarity and edge similarity corresponding to the first knowledge graph and the second knowledge graph in response to determining that the number of the first edges and the number of the second edges are larger than the target value. The target value may be a preset value. For example, the target value is 10.

As an example, the execution subject may determine, as the entity similarity, an entity coincidence ratio between the entity information set corresponding to the first knowledge-graph and the entity information set corresponding to the second knowledge-graph. Firstly, the executing body may determine the same entity information between the entity information set corresponding to the first knowledge graph and the entity information set corresponding to the second knowledge graph, as the target entity information, to obtain the target entity information set. And then, determining the edge superposition rate between each edge corresponding to the target entity information set in the first knowledge-graph and each edge corresponding to the target entity set in the second knowledge-graph as the edge similarity.

And sixthly, generating a second text similarity according to the entity similarity and the edge similarity.

As an example, the execution body may perform weighted summation processing on the entity similarity and the edge similarity to obtain the second text similarity.

And seventhly, carrying out similarity weighted summation processing on the first text similarity and the second text similarity to obtain weighted summation similarity serving as the text similarity.

and a first step of determining the text semantic overall similarity between the first input text and the second input text. The text semantic overall similarity can represent the semantic similarity of the text overall.

And secondly, determining the semantic similarity of the paragraphs between the first input text and the second input text. Wherein, the paragraph semantic similarity can characterize the semantic similarity between paragraphs.

And thirdly, carrying out similarity weighted summation processing on the text semantic overall similarity and the paragraph semantic similarity to obtain weighted summation similarity serving as the text similarity.

Optionally, the generating the file similarity for the two input files may include the following steps:

The first step, a first input image set, a second input image set, a first input table set and a second input table set corresponding to the two input files are determined.

And a second step of determining image similarity between each first input image in the first input image set and each second input image in the second input image set.

And a third step of determining a table similarity between each of the first input tables and each of the second input tables in the first input table set.

And a fourth step of generating the text similarity according to the text similarity, the image similarity and the table similarity.

In step 1043, in response to determining that the text similarity is greater than the second value, a file similarity is generated for the two input files.

In some embodiments, in response to determining that the text similarity is greater than a second value, the execution body may generate file similarities for the two investment files. The file similarity may be a similarity between file contents. The second value may be a preset value. For example, the second value may be 0.6.

Optionally, determining the image similarity between each first input image in the first input image set and each second input image in the second input image set includes the following steps:

For each first input image in the first input image set, performing the following feature matching step:

and a first sub-step of inputting the first input image into a first residual network model included in the image feature extraction model to generate first image feature information. The first image feature information may represent content feature semantics of image content corresponding to the first input image. The image feature extraction model may be a neural network model that generates image feature information. The first residual network model may be a multi-layer serial connected residual network.

And a second sub-step of performing a first upsampling process on the first image feature information to generate first image upsampled feature information. The feature dimension corresponding to the up-sampling feature information of the first image is larger than the feature dimension corresponding to the feature information of the first image.

As an example, the execution body may perform the first upsampling process on the first image characteristic information through a multi-layer series connected convolutional neural network to generate the first image upsampling characteristic information.

And a third sub-step of performing a first downsampling process of a predetermined dimension on the first image upsampling feature information to generate first image downsampling feature information. The feature dimension corresponding to the downsampling feature information of the first image is larger than the feature dimension corresponding to the feature information of the first image. The feature dimension corresponding to the up-sampling feature information of the first image is larger than the feature dimension corresponding to the down-sampling feature information of the first image.

And a fourth sub-step of performing a second upsampling process on the first image feature information to generate second image upsampled feature information. The feature dimension corresponding to the up-sampling feature information of the second image is equal to the feature dimension corresponding to the down-sampling feature information of the first image.

And a fifth sub-step of determining a first feature similarity between the first image downsampled feature information and the second image upsampled feature information. The first feature similarity may characterize a similarity between corresponding feature information. The first feature similarity may be a value between 0 and 1. The larger the number, the more similar the characterization feature information.

And a sixth sub-step, in response to determining that the first feature similarity is smaller than the target value, inputting the first image feature information into a 10-layer serial connected convolutional neural network to obtain second image feature information. Wherein the target value may be a predetermined value. For example, the target value may be 0.7.

And a seventh sub-step of performing a second downsampling process on the first image feature information to generate second image downsampled feature information in response to determining that the first feature similarity is greater than or equal to the target value. The feature dimension corresponding to the downsampling feature information of the second image is smaller than the feature dimension corresponding to the feature information of the first image.

And an eighth substep of performing a third upsampling process of a predetermined dimension on the second image upsampling feature information to generate third image upsampling feature information. And the feature dimension corresponding to the up-sampling feature information of the third image is larger than the feature dimension corresponding to the down-sampling feature information of the second image.

And a ninth substep of performing a third downsampling process on the first image feature information to generate third image downsampled feature information. The feature dimension corresponding to the third image downsampling feature information is equal to the feature dimension corresponding to the third image upsampling feature information.

And a tenth substep of determining a second feature similarity between the third image downsampling feature information and the third image upsampling feature information. Wherein the second feature similarity may characterize the similarity between corresponding feature information. The second feature similarity may be a value between 0 and 1. The larger the number, the more similar the characterization feature information.

And an eleventh sub-step of performing feature information fusion on the second image downsampling feature information and the first image upsampling feature information in response to determining that the first feature similarity is greater than or equal to the target value, to obtain fused feature information, as first candidate feature information.

And a twelfth substep, in response to determining that the similarity of the first feature is less than or equal to the target value, of inputting the first image feature information into a 10-layer serial connected convolutional neural network to obtain second image feature information as first candidate feature information. Wherein the target value may be a predetermined value. For example, the target value may be 0.7.

And a thirteenth substep of generating second candidate feature information corresponding to each second input image in the second input image set, to obtain a second candidate feature information set. Particular implementations may refer to the generation of the first candidate feature information.

And a fourteenth sub-step of determining, as the image similarity, a vector similarity corresponding to the first candidate feature information and each of the second candidate feature information in the second candidate feature information set.

The above-mentioned "optional" related content, as one of the inventions of the present disclosure, solves the second technical problem mentioned in the background art that "there is a deviation in image feature extraction of input images, and more neural networks are often needed to extract feature information, resulting in larger computation memory occupation and low computation efficiency. Therefore, the method and the device avoid the repeated image feature extraction of the input image as much as possible through the up-sampling and down-sampling modes, ensure the extraction efficiency of feature information and the extraction quality of the input image on the basis of fewer networks, and occupy smaller memory resources in the subsequent practical application process.

Step 1044, sending the file similarity and the file information corresponding to the two input files to an auditing end, so as to generate text similarity information for the two input files.

In some embodiments, the execution body may send the file similarity and file information corresponding to the two input files to an auditing end, so as to generate text similarity information for the two input files. The auditing terminal can be a terminal for auditing real information of file similarity and fine tuning the file similarity.

With further reference to fig. 2, as an implementation of the method shown in the above figures, the present disclosure provides some embodiments of a text similarity information generating apparatus, which correspond to those method embodiments shown in fig. 1, and which are particularly applicable to various electronic devices.

As shown in fig. 2, a text similarity information generating apparatus 200 includes: an acquisition unit 201, an analysis processing unit 202, a removal unit 203, and an execution unit 204. Wherein, the acquisition unit 201 is configured to acquire an input file set and an input file; an analysis processing unit 202 configured to perform file analysis processing on the input file set and the input file set, respectively, to generate an input text set and an input text; a removing unit 203 configured to remove, from each of the drop texts in the drop text set, text contents that satisfy a preset similarity condition with the contents between the drop texts, to generate a removed drop text as a drop text; an execution unit 204 configured to execute the following generation steps for each two investment files in the obtained investment file set: determining the file outline similarity between the two input files; determining text similarity between two input texts corresponding to the two input files in response to determining that the file summary similarity is greater than a first numerical value; generating file similarity for the two input files in response to determining that the text similarity is greater than a second value; and sending the file similarity and the file information corresponding to the two input files to an auditing end to generate text similarity information for the two input files.

It will be appreciated that the elements described in the text similarity information generating apparatus 200 correspond to the respective steps in the method described with reference to fig. 1. Thus, the operations, features and advantages described above for the method are equally applicable to the text similarity information generating apparatus 200 and the units contained therein, and are not described herein.

Referring now to fig. 3, a schematic diagram of an electronic device (e.g., electronic device) 300 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various suitable actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data required for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

In general, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 308 including, for example, magnetic tape, hard disk, etc.; and communication means 309. The communication means 309 may allow the electronic device 300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 3 shows an electronic device 300 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 3 may represent one device or a plurality of devices as needed.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such embodiments, the computer program may be downloaded and installed from a network via communications device 309, or from storage device 308, or from ROM 302. The above-described functions defined in the methods of some embodiments of the present disclosure are performed when the computer program is executed by the processing means 301.

It should be noted that, in some embodiments of the present disclosure, the computer readable medium may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring an input file set and an input file; respectively carrying out file analysis processing on the input file set and the input file to generate an input text set and an input text; removing text contents which meet preset similar conditions with the contents among the input texts from each input text in the input text set to generate a removed input text as input text; for each two investment files in the resulting investment file set, the following generating steps are performed: determining the file outline similarity between the two input files; determining text similarity between two input texts corresponding to the two input files in response to determining that the file summary similarity is greater than a first numerical value; generating file similarity for the two input files in response to determining that the text similarity is greater than a second value; and sending the file similarity and the file information corresponding to the two input files to an auditing end to generate text similarity information for the two input files.

Computer program code for carrying out operations for some embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in some embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes an acquisition unit, an analysis processing unit, a removal unit, and an execution unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the acquisition unit may also be described as "a unit that acquires a set of posted files and posted files".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A text similarity information generation method comprises the following steps:

acquiring an input file set and an input file;

respectively carrying out file analysis processing on the input file set and the input file to generate an input text set and an input text;

removing text contents which meet preset similar conditions with the contents among the input texts from each input text in the input text set to generate removed input texts as input texts;

for each two investment files in the resulting investment file set, the following generating steps are performed:

determining file summary similarity between the two investment files;

determining, in response to determining that the file summary similarity is greater than a first value, a text similarity between two input texts corresponding to the two input files;

generating file similarity for the two input files in response to determining that the text similarity is greater than a second numerical value;

sending the file similarity and the file information corresponding to the two input files to an auditing end to generate text similarity information for the two input files,

the determining the text similarity between the two input texts corresponding to the two input files comprises the following steps:

Performing text word segmentation on the two input texts to respectively obtain a first word set and a second word set;

for each word in the first set of words, performing the following first processing step:

determining the word weight of the word in the first word set as a first word weight;

generating a word hash signature value for the word as a first word hash signature value;

multiplying the first word hash signature value and the first word weight to obtain a first multiplication result;

adding the first multiplication results of the obtained first multiplication result sets to obtain a first addition result;

carrying out hash signature processing on the first addition result to generate a second word hash signature value;

for each word in the second set of words, performing the following second processing step:

determining the word weight of the word in the second word set as a second word weight;

generating a word hash signature value for the word as a third word hash signature value;

multiplying the second word hash signature value with the second word weight to obtain a second multiplication result;

adding the second multiplication results of the second multiplication results set to obtain a second addition result;

Carrying out hash signature processing on the second addition result to generate a fourth word hash signature value;

determining the difference between the second word hash signature value and the fourth word hash signature value to obtain a difference result;

dividing the difference result by the binary bit number corresponding to the first addition result to obtain a division result;

subtracting the division result from the target value to obtain a subtraction result, and taking the subtraction result as text similarity,

or determining the text similarity between the two input texts corresponding to the two input files comprises the following steps:

performing title segmentation on the first input text and the second input text according to file title information to respectively obtain a first input text segmentation sequence and a second input text segmentation sequence, wherein the two input texts comprise: the first drop-in text and the second drop-in text;

determining a first drop text vector corresponding to each first drop text segment in the sequence of first drop text segments, and determining a second drop text vector corresponding to each second drop text segment in the sequence of second drop text segments;

inputting each first input text vector in the obtained first input text vector sequence to a first cyclic neural network model with multiple layers connected in series to generate a first text characteristic information sequence;

Inputting each second input text vector in the obtained second input text vector sequence to a second cyclic neural network model with multiple layers connected in series to generate a second text characteristic information sequence;

determining feature similarity between each first text feature information in the first text feature information sequence and each second text feature information in the second text feature information sequence to obtain a feature similarity set;

generating at least one feature information group according to the feature similarity set, wherein the feature similarity between the first text feature information and the second text feature information included in the feature information group is larger than a third numerical value;

for each of the at least one feature information set, performing the following information generating step:

inputting the first text feature information and the second text feature information included in the feature information group into a semantic information generation model to generate first text semantic information and second text semantic information;

determining semantic similarity corresponding to the first text semantic information and the second text semantic information;

generating determination information for representing whether the first text feature information and the second text feature information in the feature information group are similar feature information according to the semantic similarity;

Determining the group number of target feature information groups in the at least one feature information group according to the obtained at least one piece of determination information;

generating a first text similarity according to the number of groups;

for each target feature information group in at least one target feature information group, determining a first target input text and a second target input text corresponding to the target feature information group;

generating a first knowledge graph for the first target input text and a second knowledge graph for the second target input text;

determining the first edge number corresponding to the first knowledge graph and the second edge number corresponding to the second knowledge graph;

determining entity similarity and edge similarity corresponding to the first knowledge-graph and the second knowledge-graph in response to determining that the number of the first edges and the number of the second edges are both greater than a target value;

generating a second text similarity according to the entity similarity and the edge similarity;

and carrying out similarity weighted summation processing on the first text similarity and the second text similarity to obtain weighted summation similarity serving as the text similarity.

2. The method of claim 1, wherein the determining the text similarity between the two input texts corresponding to the two input files comprises:

Determining the text semantic overall similarity between the first input text and the second input text;

determining paragraph semantic similarity between the first input text and the second input text;

and carrying out similarity weighted summation processing on the text semantic overall similarity and the paragraph semantic similarity to obtain weighted summation similarity serving as the text similarity.

3. The method of claim 2, wherein the generating file similarities for the two investment files comprises:

determining a first input image set, a second input image set, a first input table set and a second input table set corresponding to the two input files;

determining image similarity between each first input image in the first input image set and each second input image in the second input image set;

determining a table similarity between each first input table in the first input table set and each second input table in the second input table set;

and generating the text similarity according to the text similarity, the image similarity and the table similarity.

4. A text similarity information generating apparatus comprising:

An acquisition unit configured to acquire an input file set and an input file;

the analysis processing unit is configured to respectively conduct file analysis processing on the input file set and the input file to generate an input text set and an input text;

a removing unit configured to remove text content satisfying a preset similarity condition with content between the input texts from each input text in the input text set, so as to generate a removed input text as input text;

an execution unit configured to execute the following generation steps for each two investment files in the obtained investment file set: determining file summary similarity between the two investment files; determining, in response to determining that the file summary similarity is greater than a first value, a text similarity between two input texts corresponding to the two input files; generating file similarity for the two input files in response to determining that the text similarity is greater than a second numerical value; transmitting the file similarity and the file information corresponding to the two input files to an auditing end to generate text similarity information for the two input files, wherein the determining the text similarity between the two input texts corresponding to the two input files comprises the following steps: performing text word segmentation on the two input texts to respectively obtain a first word set and a second word set; for each word in the first set of words, performing the following first processing step: determining the word weight of the word in the first word set as a first word weight; generating a word hash signature value for the word as a first word hash signature value; multiplying the first word hash signature value and the first word weight to obtain a first multiplication result; adding the first multiplication results of the obtained first multiplication result sets to obtain a first addition result; carrying out hash signature processing on the first addition result to generate a second word hash signature value; for each word in the second set of words, performing the following second processing step: determining the word weight of the word in the second word set as a second word weight; generating a word hash signature value for the word as a third word hash signature value; multiplying the second word hash signature value with the second word weight to obtain a second multiplication result; adding the second multiplication results of the second multiplication results set to obtain a second addition result; carrying out hash signature processing on the second addition result to generate a fourth word hash signature value; determining the difference between the second word hash signature value and the fourth word hash signature value to obtain a difference result; dividing the difference result by the binary bit number corresponding to the first addition result to obtain a division result; subtracting the division result from the target value to obtain a subtraction result, wherein the subtraction result is used as the text similarity or the text similarity between the two input texts corresponding to the two input files is determined, and the method comprises the following steps: performing title segmentation on the first input text and the second input text according to file title information to respectively obtain a first input text segmentation sequence and a second input text segmentation sequence, wherein the two input texts comprise: the first drop-in text and the second drop-in text; determining a first drop text vector corresponding to each first drop text segment in the sequence of first drop text segments, and determining a second drop text vector corresponding to each second drop text segment in the sequence of second drop text segments; inputting each first input text vector in the obtained first input text vector sequence to a first cyclic neural network model with multiple layers connected in series to generate a first text characteristic information sequence; inputting each second input text vector in the obtained second input text vector sequence to a second cyclic neural network model with multiple layers connected in series to generate a second text characteristic information sequence; determining feature similarity between each first text feature information in the first text feature information sequence and each second text feature information in the second text feature information sequence to obtain a feature similarity set; generating at least one feature information group according to the feature similarity set, wherein the feature similarity between the first text feature information and the second text feature information included in the feature information group is larger than a third numerical value; for each of the at least one feature information set, performing the following information generating step: inputting the first text feature information and the second text feature information included in the feature information group into a semantic information generation model to generate first text semantic information and second text semantic information; determining semantic similarity corresponding to the first text semantic information and the second text semantic information; generating determination information for representing whether the first text feature information and the second text feature information in the feature information group are similar feature information according to the semantic similarity; determining the group number of target feature information groups in the at least one feature information group according to the obtained at least one piece of determination information; generating a first text similarity according to the number of groups; for each target feature information group in at least one target feature information group, determining a first target input text and a second target input text corresponding to the target feature information group; generating a first knowledge graph for the first target input text and a second knowledge graph for the second target input text; determining the first edge number corresponding to the first knowledge graph and the second edge number corresponding to the second knowledge graph; determining entity similarity and edge similarity corresponding to the first knowledge-graph and the second knowledge-graph in response to determining that the number of the first edges and the number of the second edges are both greater than a target value; generating a second text similarity according to the entity similarity and the edge similarity; and carrying out similarity weighted summation processing on the first text similarity and the second text similarity to obtain weighted summation similarity serving as the text similarity.

5. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-3.

6. A computer readable medium having stored thereon a computer program, wherein the program when executed by a processor implements the method of any of claims 1-3.