CN114417811A

CN114417811A - Similarity calculation method and device based on semantics and storage medium

Info

Publication number: CN114417811A
Application number: CN202111660511.2A
Authority: CN
Inventors: 胡成
Original assignee: Beijing Jiesi Security Technology Co ltd
Current assignee: Beijing Jiesi Security Technology Co ltd
Priority date: 2021-12-30
Filing date: 2021-12-30
Publication date: 2022-04-29

Abstract

The invention discloses a similarity calculation method, a similarity calculation device and a storage medium based on semantics, wherein the method comprises the following steps: processing the provided service document to generate a template; the processing comprises word segmentation processing of the business document and construction of a space vector for the word segmentation; setting keywords and key sentences which are associated with document semantics for the generated template; processing the document to be matched in the same mode as the template is generated, and then performing matching calculation on the document to be matched and the template to obtain matching similarity; the matching calculation comprises word frequency similarity, weighted keyword matching degree and weighted keyword sentence matching degree calculation; if the matching similarity reaches a set threshold, the document to be matched is a document needing specific protection; the beneficial effects are as follows: in addition to conventional word frequency similarity calculation, the whole scheme also adds weighting processing of keywords and key sentences of associated semantics, so that a matching result is more accurate, and the occurrence of corresponding misjudgment situations is reduced.

Description

Similarity calculation method and device based on semantics and storage medium

Technical Field

The invention relates to the technical field of text similarity, in particular to a similarity calculation method and device based on semantics and a storage medium.

Background

In the endpoint security industry, whether a user specific service document is referred by other texts needs to be detected, a common matching mode is to define a sensitive word in advance, search is performed in a document by adopting a character string comparison mode, and the matched specific sensitive word is considered to belong to a sensitive document and needs to be protected.

Although the scheme of judging the similarity by means of word segmentation vectors appears in the prior art, the judgment is not carried out based on document content and semantics, so that misjudgment is easily generated on a matching result.

Disclosure of Invention

Aiming at the technical defects in the prior art, the embodiments of the present invention provide a semantic-based similarity calculation method, apparatus and storage medium, which can make the matching result more accurate and thereby reduce the misjudgment.

In order to achieve the above object, in a first aspect, an embodiment of the present invention provides a similarity calculation method based on semantics, where the method includes:

processing the provided service document to generate a template; the processing comprises word segmentation processing of the business document and construction of a space vector for the word segmentation;

setting keywords and key sentences associated with document semantics for the generated template;

processing the document to be matched in the same mode as the template is generated, and then performing matching calculation on the document to be matched and the template to obtain matching similarity; the matching calculation comprises word frequency similarity calculation, weighted keyword matching calculation and weighted keyword sentence matching calculation;

and if the matching similarity reaches a set threshold, the document to be matched is a document needing specific protection.

Preferably, during the matching calculation, it is first determined whether the document to be matched is a subset of the service document, and if so, it is directly determined that the document to be matched is a document that needs to be specifically protected without calculation.

Preferably, the weighted keyword matching degree is obtained by the following steps:

firstly, respectively acquiring a word segmentation list of the service document and a word segmentation list of a document to be matched;

then, taking a list with many word segmentations as a denominator, and taking the number of the longest same word segmentation part in the business document and the document to be matched as a numerator to obtain the matching degree of the keywords;

and finally, combining the keyword matching degree with a preset keyword weighted value to obtain the weighted keyword matching degree.

Preferably, the weighted key sentence matching degree is obtained by the following steps:

extracting key sentences from the service document and the document to be matched respectively to form respective key sentence lists;

taking a list with many key sentences as a calculation denominator, and taking the number of sentences with similar key sentences in the two lists as numerators to obtain the matching degree of the key sentences;

and finally, combining the matching degree of the key sentences with a preset weight value of the key sentences to obtain the matching degree of the weighted key sentences.

In a second aspect, an embodiment of the present invention further provides a similarity calculation apparatus based on semantics, including:

the template generating module is used for processing the provided service document to generate a template; the processing comprises word segmentation processing of the business document and construction of a space vector for the word segmentation;

the setting module is used for setting keywords and key sentences which are associated with document semantics for the generated template;

the document to be matched generating module is used for processing the document to be matched according to the same mode of generating the template;

a similarity calculation module to:

after the document to be matched is processed, matching calculation is carried out on the document to be matched and the template to obtain matching similarity; the matching calculation comprises word frequency similarity calculation, weighted keyword matching calculation and weighted keyword sentence matching calculation;

if the matching similarity reaches a set threshold, the document to be matched is a document needing specific protection;

and the returning module is used for displaying the matching calculation result obtained by the similarity calculation module.

In a third aspect, the present invention also provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the steps of the method as provided in the first aspect.

The embodiment of the invention is implemented by processing the business documents providing the materials to generate a template, setting the associated keywords and key sentences, then processing the documents to be matched according to the same mode as the template, and then calculating the word frequency similarity, the weighted keyword matching degree and the weighted key sentence matching degree with the template; in addition to conventional word frequency similarity calculation, the whole scheme also adds weighting processing of keywords and key sentences of associated semantics, so that a matching result is more accurate, and the occurrence of corresponding misjudgment situations is reduced.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below.

Fig. 1 is a flowchart of a semantic-based similarity calculation method according to an embodiment of the present invention;

fig. 2 is a block diagram of a semantic-based similarity calculation apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a method for calculating similarity based on semantics according to an embodiment of the present invention includes:

s101, processing the provided service document to generate a template; the processing comprises word segmentation processing of the business documents and construction of space vectors of the word segmentation.

Specifically, the business document is a material document provided for a business scenario of a user actual application process; for example, marketing strategies, planning reports, etc. in an enterprise that involve confidential content;

the method comprises the steps of learning a material document, performing word segmentation processing, removing stop words, calculating word frequency, constructing word frequency vectors and the like by extracting document contents, and generating the template.

And S102, setting keywords and key sentences which are associated with document semantics for the generated template.

Specifically, the setting comprises two setting modes; one of the words is obtained according to the word frequency quantity, and the other words is obtained by pre-marking according to the type of the service document; the key sentence is composed of a plurality of keywords.

S103, processing the document to be matched in the same mode as the template is generated, and then performing matching calculation on the document to be matched and the template to obtain matching similarity; and the matching calculation comprises word frequency similarity calculation, weighted keyword matching calculation and weighted keyword sentence matching calculation.

Specifically, the matching similarity is word frequency space vector similarity, word frequency weight + content keyword matching degree, keyword sentence matching degree; the setting of the weight values may be performed in parallel in the setting process.

The calculation of the word frequency similarity is a mature prior art and is not described herein in detail;

the matching degree of the weighted keywords is obtained through the following steps:

then, taking a list with many word segmentations as a denominator, and taking the number of the longest same word segmentation part in the business document and the document to be matched as a numerator to obtain the matching degree of the keywords; namely, during matching, the segmentation possibly occurs at any position of the document, so that the front and back sequence of the segmentation is limited and reflected, and the situation that the content of the original business document is greatly different due to a conventional similarity judgment mode is reduced;

finally, combining the keyword matching degree with a preset keyword weight value to obtain the weighted keyword matching degree;

similarly, the matching degree of the weighted key sentence is obtained by the following steps:

finally, combining the matching degree of the key sentences with a preset weight value of the key sentences to obtain the matching degree of the weighted key sentences;

it should be noted that, in this embodiment, the keyword matching degree and the content keyword matching degree have the same meaning; the matching degree of the key sentences has the same meaning with the matching degree of the content key sentences.

And S104, if the matching similarity reaches a set threshold, the document to be matched is a document needing specific protection.

Specifically, the setting of the threshold may be adjusted according to the service type, and is not limited herein.

To facilitate a better understanding of the invention, a specific example is described.

Setting the word frequency weight value to be 0.5, the keyword weight value to be 0.3 and the keyword sentence weight value to be 0.2;

1, content keyword matching degree calculation method

The business document A and the comparison article B are respectively subjected to word segmentation, and the word segmentation result is as follows:

word segmentation list of business document a: a1, A2 … … A15

Comparing the word segmentation list of article B: b1, B2 … … B21

Finding out the business document A and the longest identical word segmentation part segment in the comparison article B by taking the list with many words as a calculation denominator (here, the list of B) (assuming that A6, A7, A8, A9 and A10 are identical to B5, B6, B7, B8 and B9), and then the similarity calculation formula of position matching is 5/21;

2, calculating matching degree of key sentences in content

The business document A and the comparison article B respectively extract key sentences, and the results are as follows:

list of key sentences of the business document a: w1, W2

List of key sentences of article B: c1, C2.

Taking a list with many key sentences as a calculation denominator, adding 1 to the numerators with similar key sentences, and assuming that 5 sentences are similar, the similarity calculation formula matched with the key sentences is 5/9;

the matching similarity is 0.5+ (5/21) × 0.3+ (5/9) × 0.2.

In another embodiment, to further improve the processing efficiency, the method further comprises:

when the documents to be matched are matched and calculated, whether the documents to be matched are the subset of the business documents is judged, if yes, calculation is not needed, and the documents to be matched are directly judged to be the documents needing specific protection.

Therefore, the situation that the difference between the business documents of the selected materials and the article to be compared is large can be conveniently handled, the corresponding matching calculation process can be reduced through the judgment of the subsets, and the efficiency is further improved.

For example: and if the A is completely contained in the B or the B is completely contained in the A, the matching degree of the content keywords and the matching degree of the content keywords are not calculated.

According to the technical scheme, a business document providing materials is processed to generate a template, related keywords and key sentences are set, then the document to be matched is processed in the same mode as the template, and then word frequency similarity, weighted keyword matching degree and weighted key sentence matching degree are calculated with the template; in addition to conventional word frequency similarity calculation, the whole scheme also adds weighting processing of keywords and key sentences of associated semantics, so that a matching result is more accurate, and the occurrence of corresponding misjudgment situations is reduced.

Based on the same inventive concept, the embodiment of the present invention provides a similarity calculation apparatus based on semantics, as shown in fig. 2, including a template generation module 1, a setting module 2, a to-be-matched document generation module 3, a similarity calculation module 4, and a return module 5.

The template generating module 1 is used for processing the provided service document to generate a template; the processing comprises word segmentation processing of the business document and construction of a space vector for the word segmentation;

the setting module 2 is used for setting keywords and key sentences which are associated with document semantics for the generated template;

the document to be matched generating module 3 is used for processing the document to be matched according to the same mode of generating the template;

a similarity calculation module 4, configured to:

and the returning module 5 is used for displaying the matching calculation result obtained by the similarity calculation module.

When the method is applied, the matching degree of the weighted keywords is obtained through the following steps:

Further, in order to improve processing efficiency, during the matching calculation, it is first determined whether the document to be matched is a subset of the service document, and if so, it is directly determined that the document to be matched is a document that needs specific protection without calculation.

It should be noted that, for a more specific workflow of the similarity calculation apparatus, please refer to the foregoing method embodiment, which is not described herein again.

The implementation of the scheme overcomes the defect that the existing similarity matching algorithm mainly judges the similarity in a word segmentation vector mode and does not judge based on document content and semantics; in the scheme, the conventional word frequency space vector is used for adding the similarity, and the weighting processing of the content key words and the content key sentences is also added, so that the matching result is more accurate, and the occurrence of misjudgment is reduced.

In this embodiment, a computer-readable storage medium is further provided, where a computer program is stored, and when executed by a processor, the computer program causes the processor to execute the steps of the embodiment of the semantic-based similarity calculation method.

In particular, the computer-readable storage medium may include Cache (Cache), high-speed Random Access Memory (RAM), such as common double data rate synchronous dynamic random access memory (DDR SDRAM), and may also include non-volatile memory (NVRAM), such as one or more read-only memories (ROM), disk storage devices, Flash memory (Flash) memory devices, or other non-volatile solid-state memory devices, such as compact disk (CD-ROM, DVD-ROM), floppy disks or data tapes, and so forth.

Those of ordinary skill in the art will appreciate that the various illustrative modules and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention.

Claims

1. A semantic-based similarity calculation method, comprising:

2. The semantic-based similarity calculation method according to claim 1, wherein during the matching calculation, it is first determined whether the document to be matched is a subset of the business documents, and if so, it is directly determined that the document to be matched is a document that needs specific protection without calculation.

3. A semantic-based similarity calculation method according to claim 1 or 2, wherein the weighted keyword matching degree is obtained by:

4. The semantic-based similarity calculation method according to claim 3, wherein the weighted key sentence matching degree is obtained by the following steps:

5. A semantic-based similarity calculation apparatus, comprising:

a similarity calculation module to:

6. The semantic-based similarity calculation device according to claim 5, wherein during the matching calculation, it is first determined whether the document to be matched is a subset of the business documents, and if so, it is directly determined that the document to be matched is a document that needs specific protection without calculation.

7. A semantic-based similarity calculation apparatus according to claim 5 or 6 wherein the weighted keyword matching score is derived by:

8. The semantic-based similarity computation apparatus according to claim 7, wherein the weighted key sentence matching degree is obtained by:

9. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out the steps of a semantic-based similarity calculation method according to any one of claims 1 to 4.