CN108287824A

CN108287824A - Semantic similarity calculation method and device

Info

Publication number: CN108287824A
Application number: CN201810188175.8A
Authority: CN
Inventors: 李勤
Original assignee: Beijing Yunzhisheng Information Technology Co Ltd
Current assignee: Beijing Yunzhisheng Information Technology Co Ltd
Priority date: 2018-03-07
Filing date: 2018-03-07
Publication date: 2018-07-17

Abstract

The present invention relates to a kind of semantic similarity calculation method and devices, wherein method includes：The first sentence of sentence centering and the second sentence are pre-processed respectively, the statistical nature between the first syntax of extraction, the second syntax and the first sentence and the second sentence；Respectively by the first sentence and the second sentence word and part of speech be converted to vector, obtain corresponding fisrt feature matrix and second characteristic matrix；Determine that corresponding first sentence tentatively indicates and the second sentence tentatively indicates according to preset first deep neural network model；The similarity between the first sentence and the second sentence is determined according to preset second deep neural network model；Determine whether the first sentence and the second sentence are similar according to the similarity between the first sentence and the second sentence.With this solution, the statistical nature in word feature, word order feature, phrase feature and sentence level has been merged, the similarity between sentence can be determined more accurately.

Description

Semantic similarity calculation method and device

Technical field

The present invention relates to semantics recognition technical field more particularly to a kind of semantic similarity calculation method and devices.

Background technology

Semantic Similarity Measurement mainly judges whether two sentence semantics are similar, for example judges " what animal the arctic has " Whether " having which animal to live in the arctic " be similar.Present semantic similarity is mainly based upon literal syntactic feature, leads to Feature selecting is crossed, by sentence expression at vector, cosine similarity then is calculated to two sentences, is more than setting similarity then phase Seemingly, otherwise dissimilar.

Existing similarity calculation is primarily present problems with：

1) lack to the word order of sentence and portraying for semanteme；

2) a large amount of high accurately synonyms or alignment phrase resource are relied on.

Invention content

A kind of semantic similarity calculation method of offer of the embodiment of the present invention and device more accurately determine sentence to realize Between similarity.

According to a first aspect of the embodiments of the present invention, a kind of semantic similarity calculation method is provided, including：

The first sentence of sentence centering and the second sentence are pre-processed respectively, the first sentence of extraction is first corresponding Statistical nature between corresponding second syntax of method, the second sentence and first sentence and second sentence；

Respectively by first sentence and the second sentence word and part of speech be converted to vector, it is special to obtain corresponding first Levy matrix and second characteristic matrix；

It is determined and is corresponded to according to the fisrt feature matrix, second characteristic matrix and preset first deep neural network model The first sentence tentatively indicate and the second sentence tentatively indicate；

It is tentatively indicated according to first sentence, the second sentence tentatively indicates, the corresponding statistical nature of the statistical nature Preset second deep neural network model of vector sum determines the similarity between first sentence and second sentence；

First sentence and described second are determined according to the similarity between first sentence and second sentence Whether sentence is similar.

In one embodiment, it is described respectively by first sentence and the second sentence word and part of speech be converted to Amount, determines corresponding fisrt feature matrix and second characteristic matrix, including：

The word in first sentence and second sentence is converted to term vector respectively using word2vec, is obtained The corresponding first word feature matrix of first sentence and the corresponding second word feature matrix of the second sentence；

The part of speech in first sentence and second sentence is converted to part of speech vector respectively using pos2vec, is obtained To the corresponding first part of speech feature matrix of the first sentence and the corresponding second part of speech feature matrix of the second sentence；

Splice the first word feature matrix and the first part of speech feature matrix to obtain the fisrt feature matrix, Splice the second word feature matrix and the second word eigenmatrix to obtain the second characteristic matrix.

In one embodiment, described according to the fisrt feature matrix, second characteristic matrix and preset first depth Neural network model obtains corresponding first sentence and tentatively indicates tentatively to indicate with the second sentence, including：

Respectively using the fisrt feature matrix and the second characteristic matrix as first deep neural network model Input, obtain corresponding first sentence tentatively indicate and the second sentence tentatively indicate.

In one embodiment, described tentatively indicated according to first sentence, the second sentence tentatively indicates, the statistics The corresponding feature vector of feature and preset second deep neural network model determine first sentence and second sentence Between similarity, including：

First sentence is tentatively indicated respectively tentatively to indicate to do with second sentence to subtract each other point by point and be multiplied point by point Operation obtains corresponding geometric distance eigenmatrix and angular distance eigenmatrix；

The statistical nature is encoded into vector, obtains corresponding statistical nature vector；

Statistical nature vector, the geometric distance eigenmatrix and the angular distance eigenmatrix are spelled It connects, obtains splicing result；

Using the splicing result as the input of second deep neural network model, first sentence is calculated With the similarity of second sentence.

In one embodiment, described in the similarity according between first sentence and second sentence determines Whether the first sentence and second sentence are similar, including：

When the similarity between first sentence and second sentence is more than default similarity, described first is determined Sentence is similar with second sentence；

When the similarity between first sentence and second sentence is less than or equal to default similarity, determine Second sentence and second sentence are dissimilar.

According to a second aspect of the embodiments of the present invention, a kind of Semantic Similarity Measurement device is provided, including：

Processor；

Memory for storing processor-executable instruction；

Wherein, the processor is configured as：

By statistical nature vector, the geometric distance eigenmatrix and the angular distance eigenmatrix

Spliced, obtains splicing result；

It should be understood that above general description and following detailed description is only exemplary and explanatory, not It can the limitation present invention.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that understand through the implementation of the invention.The purpose of the present invention and other advantages can be by the explanations write Specifically noted structure is realized and is obtained in book, claims and attached drawing.

Below by drawings and examples, technical scheme of the present invention will be described in further detail.

Description of the drawings

The drawings herein are incorporated into the specification and forms part of this specification, and shows the implementation for meeting the present invention Example, and be used to explain the principle of the present invention together with specification.

Fig. 1 is a kind of flow chart of semantic similarity calculation method shown according to an exemplary embodiment.

Fig. 2 is the flow of step S102 in a kind of semantic similarity calculation method shown according to an exemplary embodiment Figure.

Fig. 3 is the flow chart of another semantic similarity calculation method shown according to an exemplary embodiment.

Fig. 4 is the flow of step S104 in a kind of semantic similarity calculation method shown according to an exemplary embodiment Figure.

Fig. 5 is the flow of step S105 in a kind of semantic similarity calculation method shown according to an exemplary embodiment Figure.

Specific implementation mode

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent and the consistent all embodiments of the present invention.On the contrary, they be only with it is such as appended The example of the consistent device and method of some aspects being described in detail in claims, of the invention.

Fig. 1 is a kind of flow chart of semantic similarity calculation method shown according to an exemplary embodiment.The voice phase It can be applied in terminal device or server like degree computational methods, which can be mobile phone, and computer, number is extensively Broadcast terminal, messaging devices, game console, tablet device, Medical Devices, body-building equipment, the equipment such as personal digital assistant. As shown in Figure 1, the method comprising the steps of S101-S105：

In step S101, the first sentence of sentence centering and the second sentence are pre-processed respectively, extract first System between corresponding first syntax of son, corresponding second syntax of the second sentence and first sentence and second sentence Count feature；

Wherein, the gram statistics feature of syntax, that is, sentence includes part of speech feature, word order feature, and the similarity between part of speech is special It levies, the matching degree feature between sentence, matching degree feature of word etc..

In step s 102, respectively by first sentence and the second sentence word and part of speech be converted to vector, obtain To corresponding fisrt feature matrix and second characteristic matrix；

In step s 103, according to the fisrt feature matrix, second characteristic matrix and preset first depth nerve net Network model determines that corresponding first sentence tentatively indicates and the second sentence tentatively indicates；

In step S104, tentatively indicated according to first sentence, the second sentence tentatively indicates, the statistical nature pair Preset second deep neural network model of statistical nature vector sum answered determine first sentence and second sentence it Between similarity；

In step S105, described first is determined according to the similarity between first sentence and second sentence Whether sub and described second sentence is similar.

In this embodiment, according to the word of sentence, part of speech, statistical nature and the first depth nerve net between sentence pair Network model determines the space length and COS distance between sentence, and then according to the space length and COS distance between sentence The similarity between sentence is determined, in this way, the statistics merged in word feature, word order feature, phrase feature and sentence level is special Sign, can be determined more accurately the similarity between sentence.

As shown in Fig. 2, in one embodiment, above-mentioned steps S102 includes step S201-S203：

Word in first sentence and second sentence is converted to word by step S201 respectively using word2vec Vector obtains the corresponding first word feature matrix of the first sentence and the corresponding second word feature matrix of the second sentence；

Part of speech in first sentence and second sentence is converted to word by step S202 respectively using pos2vec Property vector, obtain the corresponding first part of speech feature matrix of the first sentence and the corresponding second part of speech feature matrix of the second sentence；

Step S203 splices the first word feature matrix and the first part of speech feature matrix to obtain described first Eigenmatrix splices the second word feature matrix and the second word eigenmatrix to obtain the second characteristic matrix.

In this embodiment, respectively by sentence word and part of speech be characterized as vector, and then obtain the corresponding spy of sentence Matrix is levied, subsequently to determine the similarity between sentence according to eigenmatrix.

As shown in figure 3, in one embodiment, above-mentioned steps S103 includes step S301：

Step S301, respectively using the fisrt feature matrix and the second characteristic matrix as first depth nerve The input of network model obtains corresponding first sentence and tentatively indicates tentatively to indicate with the second sentence.

In this embodiment, using fisrt feature matrix as the input of the first deep neural network model, first is obtained Sub preliminary expression, using second characteristic matrix as the input of the first deep neural network model, obtains the second sentence and tentatively indicates, Tentatively to indicate to determine the similarity between sentence subsequently according to sentence.

As shown in figure 4, in one embodiment, above-mentioned steps S104 includes step S401- steps S404：

Step S401, first sentence is tentatively indicated respectively and second sentence tentatively indicate to do it is point-by-point subtract each other and Point-by-point multiplication operation obtains corresponding geometric distance eigenmatrix and angular distance eigenmatrix；

If input is two sentences of A, B, their preliminary expression is denoted as R_AAnd R_B, then the geometric distance table between them Be shown as dist (| R_A-R_B|), angular distance is expressed as angle (R_A⊙R_B)。

The statistical nature is encoded into vector by step S402, obtains corresponding statistical nature vector；

Step S403, by statistical nature vector, the geometric distance eigenmatrix and the angular distance feature square Battle array is spliced, and splicing result is obtained；

Institute is calculated using the splicing result as the input of second deep neural network model in step S404 State the similarity of the first sentence and second sentence.

In this embodiment, by statistical nature vector, geometric distance eigenmatrix and the angular distance feature between sentence Input of the matrix as the second deep neural network model, to sentence it is final expression convert, and then obtain two sentences it Between similar probability value, i.e. similarity between sentence, in this way, having merged word feature, word order feature, phrase feature and Sentence-level The similarity between sentence can be determined more accurately in statistical nature on not.

As shown in figure 5, in one embodiment, above-mentioned steps S105 includes step S501-S502：

Step S501, when the similarity between first sentence and second sentence is more than default similarity, really Fixed first sentence is similar with second sentence；

Step S502, when the similarity between first sentence and second sentence is similar less than or equal to default When spending, determine that second sentence and second sentence are dissimilar.

In this embodiment it is possible to default similarity is arranged, such as 80%, then the similarity between two sentences be more than Determine that two sentences are similar when 80%, otherwise, it determines two sentence dissmilarities.

Following is apparatus of the present invention embodiment, can be used for executing the method for the present invention embodiment.

Processor；

Memory for storing processor-executable instruction；

Wherein, the processor is configured as：

It should be understood by those skilled in the art that, the embodiment of the present invention can be provided as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, the present invention can be used in one or more wherein include computer usable program code computer The shape for the computer program product implemented in usable storage medium (including but not limited to magnetic disk storage and optical memory etc.) Formula.

The present invention be with reference to according to the method for the embodiment of the present invention, the flow of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be realized by computer program instructions every first-class in flowchart and/or the block diagram The combination of flow and/or box in journey and/or box and flowchart and/or the block diagram.These computer programs can be provided Instruct the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine so that the instruction executed by computer or the processor of other programmable data processing devices is generated for real The device for the function of being specified in present one flow of flow chart or one box of multiple flows and/or block diagram or multiple boxes.

These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that instruction generation stored in the computer readable memory includes referring to Enable the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one box of block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device so that count Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, in computer or The instruction executed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a box or multiple boxes.

Obviously, various changes and modifications can be made to the invention without departing from essence of the invention by those skilled in the art God and range.In this way, if these modifications and changes of the present invention belongs to the range of the claims in the present invention and its equivalent technologies Within, then the present invention is also intended to include these modifications and variations.

Claims

1. a kind of semantic similarity calculation method, which is characterized in that including：

The first sentence of sentence centering and the second sentence are pre-processed respectively, corresponding first syntax of the first sentence of extraction, Statistical nature between corresponding second syntax of second sentence and first sentence and second sentence；

Respectively by first sentence and the second sentence word and part of speech be converted to vector, obtain corresponding fisrt feature square Battle array and second characteristic matrix；

Corresponding is determined according to the fisrt feature matrix, second characteristic matrix and preset first deep neural network model One sentence tentatively indicates and the second sentence tentatively indicates；

It is tentatively indicated according to first sentence, the second sentence tentatively indicates, the corresponding statistical nature vector of the statistical nature And preset second deep neural network model determines the similarity between first sentence and second sentence；

First sentence and second sentence are determined according to the similarity between first sentence and second sentence It is whether similar.

2. semantic similarity calculation method according to claim 1, which is characterized in that described respectively by first sentence With in the second sentence word and part of speech be converted to vector, determine corresponding fisrt feature matrix and second characteristic matrix, including：

The word in first sentence and second sentence is converted to term vector respectively using word2vec, obtains first The corresponding first word feature matrix of sentence and the corresponding second word feature matrix of the second sentence；

The part of speech in first sentence and second sentence is converted to part of speech vector respectively using pos2vec, obtains the The corresponding first part of speech feature matrix of one sentence and the corresponding second part of speech feature matrix of the second sentence；

Splice the first word feature matrix and the first part of speech feature matrix to obtain the fisrt feature matrix, by institute It states the second word feature matrix and the second word eigenmatrix splices to obtain the second characteristic matrix.

3. semantic similarity calculation method according to claim 1, which is characterized in that described according to the fisrt feature square Battle array, second characteristic matrix and preset first deep neural network model obtain the tentatively expression of corresponding first sentence and second Sub preliminary expression, including：

Respectively using the fisrt feature matrix and the second characteristic matrix as the defeated of first deep neural network model Enter, obtains corresponding first sentence and tentatively indicate tentatively to indicate with the second sentence.

4. semantic similarity calculation method according to claim 1, which is characterized in that described according at the beginning of first sentence Step indicates, the second sentence tentatively indicates, the corresponding feature vector of the statistical nature and preset second deep neural network mould Type determines the similarity between first sentence and second sentence, including：

First sentence is tentatively indicated respectively and second sentence tentatively indicate to do it is point-by-point subtract each other with point-by-point multiplication operation, Obtain corresponding geometric distance eigenmatrix and angular distance eigenmatrix；

Statistical nature vector, the geometric distance eigenmatrix and the angular distance eigenmatrix are spliced, obtained To splicing result；

Using the splicing result as the input of second deep neural network model, first sentence and institute is calculated State the similarity of the second sentence.

5. semantic similarity calculation method according to any one of claim 1 to 4, which is characterized in that described according to institute It states the similarity between the first sentence and second sentence and determines whether first sentence and second sentence are similar, wrap It includes：

When the similarity between first sentence and second sentence is more than default similarity, first sentence is determined It is similar with second sentence；

When the similarity between first sentence and second sentence is less than or equal to default similarity, described in determination Second sentence and second sentence are dissimilar.

6. a kind of Semantic Similarity Measurement device, which is characterized in that including：

Processor；

Memory for storing processor-executable instruction；

Wherein, the processor is configured as：

7. Semantic Similarity Measurement device according to claim 6, which is characterized in that described respectively by first sentence With in the second sentence word and part of speech be converted to vector, determine corresponding fisrt feature matrix and second characteristic matrix, including：

8. Semantic Similarity Measurement device according to claim 6, which is characterized in that described according to the fisrt feature square Battle array, second characteristic matrix and preset first deep neural network model obtain the tentatively expression of corresponding first sentence and second Sub preliminary expression, including：

9. Semantic Similarity Measurement device according to claim 6, which is characterized in that described according at the beginning of first sentence Step indicates, the second sentence tentatively indicates, the corresponding feature vector of the statistical nature and preset second deep neural network mould Type determines the similarity between first sentence and second sentence, including：

10. the Semantic Similarity Measurement device according to any one of claim 6 to 9, which is characterized in that described according to institute It states the similarity between the first sentence and second sentence and determines whether first sentence and second sentence are similar, wrap It includes：