CN113361248B

CN113361248B - Text similarity calculation method, device, equipment and storage medium

Info

Publication number: CN113361248B
Application number: CN202110732136.1A
Authority: CN
Inventors: 莫琪
Original assignee: Ping An Puhui Enterprise Management Co Ltd
Current assignee: Ping An Puhui Enterprise Management Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2022-08-12
Anticipated expiration: 2041-06-30
Also published as: CN113361248A

Abstract

The application discloses a method, a device, equipment and a storage medium for calculating the similarity of texts, and belongs to the technical field of artificial intelligence. The method comprises the steps of obtaining a first relevance degree by calculating the relevance degree of a first word segmentation, obtaining a second relevance degree by calculating the relevance degree of a second word segmentation, carrying out vector conversion on the first word segmentation and the second word segmentation to obtain a first word vector and a second word vector, carrying out weighted summation on the first word vector based on the first relevance degree to obtain a first sentence vector, carrying out weighted summation on the second word vector based on the second relevance degree to obtain a second sentence vector, and calculating the text similarity between a first text and a second text based on the first sentence vector and the second sentence vector. In addition, the present application also relates to a blockchain technique, and the first text and the second text can be stored in a blockchain. When sentence vectors of a text are generated, the method and the device improve the accuracy of similarity calculation of the text by adding the relevance of text participles in a preset standard text.

Description

Text similarity calculation method, device, equipment and storage medium

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a text similarity calculation method, device, equipment and storage medium.

Background

At present, most of the ability dimension assessment in the market is based on objective question examination, namely, the applicant selects correct answers according to the question contents by giving different options to the applicant, so that the ability dimension assessment is relatively solidified, the answer thought has certain skills and limitations, the knowledge plane and thinking ability of the applicant cannot be examined more comprehensively, and the level of the applicant cannot be reflected truly.

In addition, although a small part of the ability dimension evaluation contains subjective narrative questions, the subjective questions are generally graded in a man-machine combination mode, namely the subjective questions are graded according to experience and understanding of personnel, or the literal similarity between answers given by an applicant and standard answers is directly calculated through a similarity calculation method. However, different personnel scoring ideas are different, the efficiency of evaluation by personnel is low, the evaluation process is influenced by self experience, the evaluation result may have deviation, when evaluation is performed by directly calculating the literal similarity of the answer, the scoring precision cannot be guaranteed, and the evaluation result may also have deviation.

Disclosure of Invention

An object of the embodiment of the present application is to provide a method, an apparatus, a computer device, and a storage medium for calculating a similarity of a text, so as to solve a technical problem of low precision caused by directly calculating a literal similarity of a text in an existing text similarity calculation scheme.

In order to solve the above technical problem, an embodiment of the present application provides a method for calculating similarity of texts, which adopts the following technical solutions:

a method of similarity calculation for text, comprising:

acquiring a first text and a second text, and performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation and a second word segmentation;

acquiring a preset standard text, calculating the correlation degree of the first participle and the standard text to obtain a first correlation degree, and calculating the correlation degree of the second participle and the standard text to obtain a second correlation degree;

respectively carrying out vector transformation on the first participle and the second participle to obtain a first word vector and a second word vector;

carrying out weighted summation on the first word vector based on the first correlation degree to obtain a first sentence vector, and carrying out weighted summation on the second word vector based on the second correlation degree to obtain a second sentence vector;

calculating a text similarity between the first text and the second text based on the first sentence vector and the second sentence vector.

Further, the step of obtaining a preset standard text, calculating the correlation between the first participle and the standard text to obtain a first correlation, and calculating the correlation between the second participle and the standard text to obtain a second correlation specifically includes:

calculating the weight of the first participle in the standard text to obtain a first weight, and calculating the weight of the second participle in the standard text to obtain a second weight;

calculating the relevance score of the first participle and the standard text to obtain a first relevance score, and calculating the relevance score of the second participle and the standard text to obtain a second relevance score;

calculating the relevance of the first participle and the standard text based on the first weight, the first relevance score and a preset BM25 algorithm to obtain a first relevance, and calculating the relevance of the second participle and the standard text based on the second weight, the second relevance score and a preset BM25 algorithm to obtain a second relevance.

Further, the step of calculating the weight of the first participle in the standard text to obtain a first weight, and calculating the weight of the second participle in the standard text to obtain a second weight specifically includes:

counting the number of texts containing the first word segmentation in the standard text to obtain a first text number, and counting the number of texts containing the second word segmentation in the standard text to obtain a second text number;

counting the number of all texts in the standard text to obtain the total number of the texts;

and calculating the weight of the first participle in the standard text based on the first text number and the text total number to obtain a first weight, and calculating the weight of the second participle in the standard text based on the second text number and the text total number to obtain a second weight.

Further, the step of calculating the relevance score of the first participle and the standard text to obtain a first relevance score, and calculating the relevance score of the second participle and the standard text to obtain a second relevance score specifically includes:

calculating the frequency of the first participle in the first text to obtain a first text frequency, and calculating the frequency of the second participle in the second text to obtain a second text frequency;

traversing the first text to obtain the text length of the first text to obtain a first text length, and traversing the second text to obtain the text length of the second text to obtain a second text length;

traversing the standard text to obtain the average length of the text;

calculating a relevance score of the first participle and the standard text based on the first text frequency, the first text length and the text average length to obtain a first relevance score, and calculating a relevance score of the second participle and the standard text based on the second text frequency, the second text length and the text average length to obtain a second relevance score.

Further, the step of calculating the relevance between the first participle and the standard text based on the first weight, the first relevance score and a preset BM25 algorithm to obtain a first relevance, and calculating the relevance between the second participle and the standard text based on the second weight, the second relevance score and a preset BM25 algorithm to obtain a second relevance specifically includes:

calculating the relevance of the first segmentation to the standard text according to the following formula:

Score(q _1i ,d)＝W _1i *R(q _1i ,d)

calculating the relevance of the second participle to the standard text according to the following formula:

Score(q _2i ,d)＝W _2i *R(q _2i ,d)

wherein, Score (q) _1i D) is the first correlation, Score (q) _2i D) is a second degree of correlation, q _1i Is the first participle, q _2i Is the second participle, d is the standard text, W _1i Is a first weight, W _2i Is a second weight, R (q) _1i And d) is the first correlation scoreValue, R (q) _2i And d) is the second correlation score.

Further, the step of performing weighted summation on the first word vector based on the first correlation degree to obtain a first sentence vector, and performing weighted summation on the second word vector based on the second correlation degree to obtain a second sentence vector specifically includes:

weighted summation of the first word vectors according to the following formula:

weighted summation of the second word vectors according to the following formula:

wherein, Vec _ Q ₁ Is a first sentence vector, Vec _ Q ₂ As a second sentence vector, Q ₁ _vec _i Is a first word vector, Q ₂ _vec _i Is the second word vector, m is the number of the first word vectors, and n is the number of the second word vectors.

Further, the step of respectively performing vector transformation on the first participle and the second participle to obtain a first word vector and a second word vector specifically includes:

calculating the word frequency of the first participle to obtain a first word frequency, and calculating the word frequency of the second participle to obtain a second word frequency;

and performing vector transformation on the first participle based on the first word frequency to obtain a first word vector, and performing vector transformation on the second participle based on the second word frequency to obtain a second word vector.

In order to solve the above technical problem, an embodiment of the present application further provides a device for calculating similarity of texts, which adopts the following technical solutions:

an apparatus for similarity calculation of text, comprising:

the word segmentation processing module is used for acquiring a first text and a second text, and performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation and a second word segmentation;

the relevancy calculation module is used for acquiring a preset standard text, calculating the relevancy of the first participle and the standard text to obtain a first relevancy, and calculating the relevancy of the second participle and the standard text to obtain a second relevancy;

the vector conversion module is used for respectively carrying out vector conversion on the first participle and the second participle to obtain a first word vector and a second word vector;

the weighted summation module is used for carrying out weighted summation on the first word vector based on the first correlation degree to obtain a first sentence vector, and carrying out weighted summation on the second word vector based on the second correlation degree to obtain a second sentence vector;

a similarity calculation module for calculating a text similarity between the first text and the second text based on the first sentence vector and the second sentence vector.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

a computer device comprising a memory having computer readable instructions stored therein and a processor implementing the steps of the method of similarity calculation of text as described above when executing the computer readable instructions.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

a computer readable storage medium having computer readable instructions stored thereon which, when executed by a processor, implement the steps of a method of similarity calculation of text as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

the application discloses a method, a device, equipment and a storage medium for calculating the similarity of texts, and belongs to the technical field of artificial intelligence. The method comprises the steps of performing word segmentation processing on two texts needing similarity calculation to obtain word segments of the two texts, calculating the correlation degrees of the two text word segments in a preset standard text respectively, converting the word segments of the two texts into corresponding word vectors respectively, performing weighted summation on the word vectors of the two text word segments respectively based on the correlation degrees of the two text word segments in the standard text to obtain sentence vectors corresponding to the two texts, and calculating the similarity of the two sentence vectors based on a cosine similarity calculation method to obtain the similarity of the two texts. When sentence vectors of a text are generated, the relevancy of text participles in a preset standard text is added, the literal similarity of the text is compared, text semantic factors are considered, and the similarity calculation precision of the text is improved.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 illustrates an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 illustrates a flow diagram for one embodiment of a method of similarity calculation of text according to the present application;

FIG. 3 shows a schematic structural diagram of an embodiment of an apparatus for similarity calculation of text according to the present application;

FIG. 4 shows a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the foregoing drawings are used for distinguishing between different objects and not for describing a particular sequential order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the method for calculating the similarity of the text provided in the embodiment of the present application is generally executed by a server, and accordingly, the device for calculating the similarity of the text is generally disposed in the server.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

With continuing reference to FIG. 2, a flow diagram of one embodiment of a method of similarity calculation of text in accordance with the present application is shown. The method for calculating the similarity of the texts comprises the following steps of:

s201, a first text and a second text are obtained, and word segmentation processing is respectively carried out on the first text and the second text to obtain a first word segmentation and a second word segmentation.

In a specific embodiment of the application, in the recruitment service scenario, the first text may be an answer to one of a set of application questions input by an applicant, the second text may be a standard answer corresponding to the one of the set of application questions, the preset standard text may be all standard answers of the set of application questions, and the second text may belong to one of the preset standard texts.

Wherein, the candidate can set the application test question according to the recruitment requirement, and a specific application test question is as follows: { "id":1, "text": please detail what did you have different opinions or divergences from colleagues in the past half year, { "id":2, "text": what is the greatest difficulty and challenge you have encountered in communicating with others? How do you handle at that time? "}, {" id ":3," text ": please describe a case of getting more satisfactory results for all parties through your communication, how do you specifically at that time? "}, {" id ":4," text ": do you feel how well do you communicate with themselves? An example is given for the case? "}. Then, the professional human expert gives out corresponding standard answers according to the contents of the applied test questions.

Specifically, when receiving the similarity calculation instruction, the server acquires the first text and the second text, and preprocesses the first text and the second text. The preprocessing comprises punctuation removal and stop word removal, and then the server carries out word segmentation processing on the preprocessed first text and the preprocessed second text respectively to obtain a plurality of first word segmentations and a plurality of second word segmentations.

In this embodiment, the electronic device (for example, the server/terminal device shown in fig. 1) on which the text similarity calculation method operates may receive the similarity calculation instruction in a wired connection manner or a wireless connection manner. It should be noted that the wireless connection means may include, but is not limited to, a 3G/4G connection, a WiFi connection, a bluetooth connection, a WiMAX connection, a Zigbee connection, a uwb (ultra wideband) connection, and other wireless connection means now known or developed in the future.

S202, acquiring a preset standard text, calculating the correlation degree of the first participle and the standard text to obtain a first correlation degree, and calculating the correlation degree of the second participle and the standard text to obtain a second correlation degree.

Specifically, after obtaining the first participle and the second participle, the server obtains a preset standard text, calculates a degree of correlation between the first participle and the standard text based on a preset BM25 algorithm to obtain a first degree of correlation, and calculates a degree of correlation between the second participle and the standard text based on a preset BM25 algorithm to obtain a second degree of correlation. The preset standard text comprises a plurality of sub-texts, and the second text is any one of the plurality of sub-texts.

The BM25 algorithm is generally used for scoring search relevance, and is also a search algorithm in the ES, and is generally used for calculating the relevance between a query and a text set. When the similarity between sentences is measured, the sentences are divided into words, the words obtained after the words are divided are expressed by word vectors, then sentence vectors representing the sentences are obtained by adding and averaging, and the similarity between the sentences is obtained by calculating the similarity of the sentence vectors.

S203, respectively carrying out vector transformation on the first participle and the second participle to obtain a first word vector and a second word vector.

Specifically, the server respectively counts word frequencies of a first participle and a second participle, and respectively performs vector conversion on the first participle and the second participle according to the counted word frequencies to obtain a first word vector and a second word vector, wherein the word frequency (TF) refers to the frequency of occurrence of a certain word of a certain file in the file. In a specific embodiment of the present application, Q is applied to the first text ₁ Obtaining m first participles after word segmentation, and then respectively converting the m first participles into word vectors to obtain m word vectors (Q) ₁ _vec ₁ ，Q ₁ _vec ₂ ，……，Q ₁ _vec _m ) For the second text Q ₂ Obtaining n first participles after word segmentation, and then respectively converting the n first participles into word vectors to obtain n word vectors (Q) ₂ _vec ₁ ，Q ₂ _vec ₂ ，……，Q ₂ _vec _n )。

S204, carrying out weighted summation on the first word vector based on the first correlation degree to obtain a first sentence vector, and carrying out weighted summation on the second word vector based on the second correlation degree to obtain a second sentence vector.

Specifically, the server performs weighted summation on the first word vector based on the first correlation degree to obtain a first sentence vector, and the server performs weighted summation on the second word vector based on the second correlation degree to obtain a second sentence vector. When sentence vectors of a text are generated, the relevancy of text participles in a preset standard text is added, the literal similarity of the text is compared, text semantic factors are considered, and the similarity calculation precision of the text is improved.

S205, calculating a text similarity between the first text and the second text based on the first sentence vector and the second sentence vector.

Specifically, the similarity between the first sentence vector and the second sentence vector is calculated based on a preset cosine similarity algorithm, so as to obtain the text similarity between the first text and the second text. The cosine similarity calculation method is to use the cosine value of the included angle between two vectors in the vector space as the measurement for measuring the difference between two individuals, namely, the included angle between the two vectors in the space is used for judging the similarity of the two vectors, the closer the cosine value is to 1, the closer the included angle is to 0 degree, namely, the more similar the two vectors are, the larger the included angle between the two vectors is, the farther the distance is, the more dissimilar the two vectors are, and the maximum distance is the included angle of the two vectors is 180 degrees.

According to the text similarity calculation method, when sentence vectors of a text are generated, the relevance of text participles in a preset standard text is added, when the text sentence vectors are generated, weighted summation is carried out on all the word vectors of the text through the relevance, text semantic factors are considered while the literal similarity of the text is compared, the similarity between two sentence vectors is calculated through a preset cosine similarity calculation method, the similarity between two texts is obtained, and the text similarity calculation precision is improved.

Wherein the server calculates the correlation between the participle and the text through a pre-established BM25 algorithm. The BM25 algorithm is commonly used for search relevance scoring, and is also a search algorithm in ES, and is commonly used for calculating the relevance of a query and a text set.

Specifically, the server calculates a weight of a first participle in a standard text to obtain a first weight, calculates a weight of a second participle in the standard text to obtain a second weight, calculates a relevance score of the first participle and the standard text to obtain a first relevance score, calculates a relevance score of the second participle and the standard text to obtain a second relevance score, calculates a relevance of the first participle and the standard text based on the first weight, the first relevance score and a preset BM25 algorithm to obtain a first relevance, and calculates a relevance of the second participle and the standard text based on the second weight, the second relevance score and a preset BM25 algorithm to obtain a second relevance.

In the above embodiment, the relevance between the first participle and the standard text and the relevance between the second participle and the standard text are calculated through a preset BM25 algorithm, and when sentence vectors of the first text and the second text are generated, the word vectors are obtained by performing weighted summation on the respective relevance. When the text similarity is calculated, the text semantic factors are considered while the literal similarity of the text is compared, the similarity between two sentence vectors is calculated through a preset cosine similarity calculation method, the similarity of the two texts is obtained, and the precision of the text similarity calculation is improved.

counting the number of texts containing the first participle in the standard text to obtain a first text number, and counting the number of texts containing the second participle in the standard text to obtain a second text number;

Specifically, the server counts the number of texts including the first segmentation in the standard text to obtain a first text number, counts the number of texts including the second segmentation in the standard text to obtain a second text number, counts the number of all texts in the standard text to obtain a total text number, and finally calculates the weight of the first segmentation in the standard text based on the first text number and the total text number to obtain a first weight, and calculates the weight of the second segmentation in the standard text based on the second text number and the total text number to obtain a second weight. The IDF algorithm is used to determine the relevance of a word to a document, and the first weight is calculated based on the IDF algorithm as follows:

the second weight is calculated based on the IDF algorithm as follows:

wherein n (q) _1i ) Is as followsA number of texts, n (q) _2i ) Is the second number of texts, and N is the total number of texts. It can be seen from the definition of IDF that the greater the number of a certain word segmentation is included for a given text, the lower the weight of the word segmentation. That is, when many texts contain the word segmentation, the degree of distinction of the word segmentation is not high, and thus the degree of importance in determining the relevance by using the word segmentation is low.

In the above embodiment, the weight of each participle and text is calculated by the IDF algorithm, so that the relevance calculation of the participle and the standard text is performed subsequently.

traversing the standard text to obtain the average length of the text;

Specifically, the server calculates the frequency of a first word in a first text to obtain a first text frequency, calculates the frequency of a second word in a second text to obtain a second text frequency, traverses the first text to obtain the text length of the first text to obtain the first text length, traverses the second text to obtain the text length of the second text to obtain the second text length, traverses the standard text to obtain the average length of the text, calculates the relevance score of the first word and the standard text based on the first text frequency, the first text length and the average text length to obtain a first relevance score, and calculates the relevance score of the second word and the standard text based on the second text frequency, the second text length and the average text length to obtain a second relevance score. Wherein, the first correlation score is calculated according to the following formula:

the second correlation score calculation formula is as follows:

wherein f is _1i Is q _1i Frequency of occurrence in standard text, f _2i Is q _2i Frequency of occurrence in standard text, qf _1i Is q _1i Frequency of occurrence in the first text, qf _2i Is q is _2i Frequency of occurrence, k, in the second text ₁ And k ₂ To adjust the factor, it is usually set empirically, typically k ₁ ＝k ₂ ，K ₁ And K ₂ As a text length parameter, K ₁ Calculated by the following formula:

K ₂ calculated by the following formula:

wherein the content of the first and second substances,dl ₁ is the first text length, dl ₂ For the second text length, avgdl is the average length of the text, and b is an adjustment factor, usually set empirically, and typically b is 0.75. From K ₁ And K ₂ It can be seen in the definition of (c), that the function of the parameter b is to adjust the size of the influence of the document length on the relevance. The larger b, the greater the influence of the document length on the relevance score and vice versa, whereas the longer the relative length of the document, K ₁ And K ₂ The larger the value will be, the smaller the relevance score will be.

In the above embodiment, the occurrence frequency of the participles in the text and the statistical text length are subjected to relevance score calculation to obtain the relevance score of each participle, and the relevance score of each participle is used for calculating the relevance of each participle.

Score(q _1i ,d)＝W _1i *R(q _1i ,d)

Score(q _2i ,d)＝W _2i *R(q _2i ,d)

wherein, Score (q) _1i And d) is the first degree of correlation, Score (q) _2i D) is a second degree of correlation, q _1i Is the first participle, q _2i Is the second participle, d is the standard text, W _1i Is a first weight, W _2i Is a second weight, R (q) _1i D) is the first relevance score, R (q) _2i And d) is the second correlation score.

In the embodiment, the relevance between the participles and the standard text is calculated through the formula, and the relevance of the text participles in the preset standard text is added, so that the similarity on the text face is compared, the text semantic factors are considered, and the precision of the calculation of the similarity of the text is improved.

In the embodiment, the word vectors are weighted and summed through the correlation degrees to obtain the sentence vectors, and when the sentence vectors of the text are generated, the correlation degrees of the text participles in the preset standard text are added, so that the text semantic factors are considered while the literal similarity of the text is compared, and the precision of the similarity calculation of the text is improved.

Specifically, the server counts word frequency of the first participle to obtain a first word frequency, counts word frequency of the second participle to obtain a second word frequency, performs vector transformation on the first participle based on the first word frequency to obtain a first word vector, and performs vector transformation on the second participle based on the second word frequency to obtain a second word vector.

In the above embodiment, the text similarity between the first text and the second text is finally obtained by using a cosine similarity algorithm, which is based on the premise that the text is converted into a vector for representation, word frequencies are counted, word segments are converted into word vectors according to the word frequencies, and then the word vectors are weighted and summed according to the correlation degrees to obtain sentence vectors of the text.

The application discloses a text similarity calculation method, and belongs to the technical field of artificial intelligence. The method comprises the steps of performing word segmentation processing on two texts needing similarity calculation to obtain word segments of the two texts, calculating the correlation degrees of the two text word segments in a preset standard text respectively, converting the word segments of the two texts into corresponding word vectors respectively, performing weighted summation on the word vectors of the two text word segments respectively based on the correlation degrees of the two text word segments in the standard text to obtain sentence vectors corresponding to the two texts, and calculating the similarity of the two sentence vectors based on a cosine similarity calculation method to obtain the similarity of the two texts. When sentence vectors of a text are generated, the relevancy of text participles in a preset standard text is added, the literal similarity of the text is compared, text semantic factors are considered, and the similarity calculation precision of the text is improved.

It is emphasized that, in order to further ensure the privacy and security of the first text and the second text, the first text and the second text may also be stored in a node of a blockchain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, can include processes of the embodiments of the methods described above. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of an apparatus 300 for calculating similarity of texts, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices in particular.

As shown in fig. 3, the apparatus 300 for calculating similarity of texts according to the present embodiment includes:

the word segmentation processing module 301 is configured to obtain a first text and a second text, and perform word segmentation processing on the first text and the second text respectively to obtain a first word segmentation and a second word segmentation;

the relevancy calculation module 302 is configured to obtain a preset standard text, calculate a relevancy between the first participle and the standard text to obtain a first relevancy, and calculate a relevancy between the second participle and the standard text to obtain a second relevancy;

the vector conversion module 303 is configured to perform vector conversion on the first participle and the second participle respectively to obtain a first word vector and a second word vector;

a weighted summation module 304, configured to perform weighted summation on the first word vector based on the first correlation degree to obtain a first sentence vector, and perform weighted summation on the second word vector based on the second correlation degree to obtain a second sentence vector;

a similarity calculation module 305, configured to calculate a text similarity between the first text and the second text based on the first sentence vector and the second sentence vector.

Further, the correlation calculation module 302 specifically includes:

the weight calculation unit is used for calculating the weight of the first participle in the standard text to obtain a first weight, and calculating the weight of the second participle in the standard text to obtain a second weight;

the correlation score calculating unit is used for calculating the correlation score of the first participle and the standard text to obtain a first correlation score, and calculating the correlation score of the second participle and the standard text to obtain a second correlation score;

and the relevancy calculation unit is used for calculating the relevancy of the first participle and the standard text based on the first weight, the first relevancy score and a preset BM25 algorithm to obtain a first relevancy, and calculating the relevancy of the second participle and the standard text based on the second weight, the second relevancy score and a preset BM25 algorithm to obtain a second relevancy.

Further, the weight calculation unit specifically includes:

a word number counting subunit, configured to count the number of texts including the first word in the standard text to obtain a first text number, and count the number of texts including the second word in the standard text to obtain a second text number;

the text total counting subunit is used for counting the number of all the texts in the standard text to obtain the text total;

and the weight calculating subunit is used for calculating the weight of the first participle in the standard text based on the first text number and the text total number to obtain a first weight, and calculating the weight of the second participle in the standard text based on the second text number and the text total number to obtain a second weight.

Further, the correlation score calculating unit specifically includes:

the text frequency calculating subunit is configured to calculate a frequency of the first participle in the first text to obtain a first text frequency, and calculate a frequency of the second participle in the second text to obtain a second text frequency;

the text length obtaining subunit is configured to traverse the first text, obtain a text length of the first text, obtain a first text length, traverse the second text, obtain a text length of the second text, and obtain a second text length;

the text average length calculation subunit is used for traversing the standard text to obtain the average length of the standard text and obtain the text average length;

and the relevance score calculating subunit is used for calculating the relevance score of the first participle and the standard text based on the first text frequency, the first text length and the text average length to obtain a first relevance score, and calculating the relevance score of the second participle and the standard text based on the second text frequency, the second text length and the text average length to obtain a second relevance score.

Further, the correlation calculation unit specifically includes:

a first relevance calculating subunit, configured to calculate a relevance of the first participle to the standard text according to the following formula:

Score(q _1i ,d)＝W _1i *R(q _1i ,d)

a second relevance calculating subunit, configured to calculate a relevance of the second participle with the standard text according to the following formula:

Score(q _2i ,d)＝W _2i *R(q _2i ,d)

wherein, Score (q) _1i D) is the first correlation, Score (q) _2i D) is a second degree of correlation, q _1i Is the first participle, q _2i Is the second participle, d is the standard text, W _1i Is a first weight, W _2i Is a second weight, R (q) _1i D) is the first relevance score, R (q) _2i And d) is the second correlation score.

Further, the weighted summation module 304 specifically includes:

a first weighted sum unit for weighted sum of the first word vectors according to the following formula:

a second weighted sum unit for weighted sum of the second word vectors according to the following formula:

Further, the vector conversion module 303 specifically includes:

the word frequency calculating unit is used for calculating the word frequency of the first participle to obtain a first word frequency and calculating the word frequency of the second participle to obtain a second word frequency;

and the vector conversion unit is used for carrying out vector conversion on the first participle based on the first word frequency to obtain a first word vector and carrying out vector conversion on the second participle based on the second word frequency to obtain a second word vector.

The application discloses device 300 for calculating similarity of texts belongs to the technical field of artificial intelligence. The method comprises the steps of performing word segmentation processing on two texts needing similarity calculation to obtain word segments of the two texts, calculating the correlation degrees of the two text word segments in a preset standard text respectively, converting the word segments of the two texts into corresponding word vectors respectively, performing weighted summation on the word vectors of the two text word segments respectively based on the correlation degrees of the two text word segments in the standard text to obtain sentence vectors corresponding to the two texts, and calculating the similarity of the two sentence vectors based on a cosine similarity calculation method to obtain the similarity of the two texts. When sentence vectors of a text are generated, the relevancy of text participles in a preset standard text is added, the literal similarity of the text is compared, text semantic factors are considered, and the similarity calculation precision of the text is improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed in the computer device 4 and various application software, such as computer readable instructions of a text similarity calculation method. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or computer readable instructions for processing data, such as executing the method for calculating the similarity of the texts.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

The application discloses equipment belongs to artificial intelligence technical field. The method comprises the steps of performing word segmentation processing on two texts needing similarity calculation to obtain word segments of the two texts, calculating the correlation degrees of the two text word segments in a preset standard text respectively, converting the word segments of the two texts into corresponding word vectors respectively, performing weighted summation on the word vectors of the two text word segments respectively based on the correlation degrees of the two text word segments in the standard text to obtain sentence vectors corresponding to the two texts, and calculating the similarity of the two sentence vectors based on a cosine similarity calculation method to obtain the similarity of the two texts. When sentence vectors of a text are generated, the relevancy of text participles in a preset standard text is added, the literal similarity of the text is compared, text semantic factors are considered, and the similarity calculation precision of the text is improved.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor to cause the at least one processor to perform the steps of the method of similarity calculation for text as described above.

The application discloses a storage medium belongs to artificial intelligence technical field. The method includes the steps of performing word segmentation processing on two texts needing similarity calculation to obtain word segments of the two texts, calculating the correlation degrees of the two text word segments in a preset standard text respectively, converting the word segments of the two texts into corresponding word vectors respectively, performing weighted summation on the word vectors of the two text word segments respectively based on the correlation degrees of the two text word segments in the standard text to obtain sentence vectors corresponding to the two texts, and finally calculating the similarity of the two sentence vectors based on a cosine similarity calculation method to obtain the similarity of the two texts. When sentence vectors of a text are generated, the relevancy of text participles in a preset standard text is added, the literal similarity of the text is compared, text semantic factors are considered, and the similarity calculation precision of the text is improved.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A method for calculating similarity of texts is characterized by comprising the following steps:

acquiring a first text and a second text, and performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation and a second word segmentation, wherein the first text is a text input by a user;

acquiring a preset standard text, calculating the correlation degree of the first participle and the standard text to obtain a first correlation degree, and calculating the correlation degree of the second participle and the standard text to obtain a second correlation degree, wherein the second text is any one text in the standard text;

calculating a text similarity between the first text and the second text based on the first sentence vector and the second sentence vector;

the step of obtaining a preset standard text, calculating the correlation between the first participle and the standard text to obtain a first correlation, and calculating the correlation between the second participle and the standard text to obtain a second correlation specifically includes:

2. The method of calculating similarity of texts according to claim 1, wherein the steps of calculating the weight of the first participle in the standard text to obtain a first weight, and calculating the weight of the second participle in the standard text to obtain a second weight specifically include:

3. The method for calculating similarity of texts according to claim 1, wherein the steps of calculating the relevance score of the first participle to the standard text to obtain a first relevance score, and calculating the relevance score of the second participle to the standard text to obtain a second relevance score specifically include:

traversing the standard text to obtain the average length of the text;

4. The method for calculating the similarity of texts according to claim 1, wherein the steps of calculating the relevance of the first participle to the standard text based on the first weight, the first relevance score and a preset BM25 algorithm to obtain a first relevance, and calculating the relevance of the second participle to the standard text based on the second weight, the second relevance score and a preset BM25 algorithm to obtain a second relevance include:

calculating the degree of correlation of the second participle and the standard text according to the following formula:

wherein, for the first degree of correlation,

is the second degree of correlation, is the first participle,

in order to be the second word-segmentation,

a standard text, a first weight, a second weight,

is the first correlation score and is the second correlation score.

5. The method of calculating similarity of texts according to claim 4, wherein the step of performing weighted summation on the first word vector based on the first correlation to obtain a first sentence vector, and performing weighted summation on the second word vector based on the second correlation to obtain a second sentence vector specifically comprises:

wherein, the first sentence vector is the second sentence vector, which is the first word vector,

is the second word vector, m is the number of the first word vectors, and n is the number of the second word vectors.

6. The method for calculating similarity of texts according to claim 5, wherein the step of performing vector transformation on the first participle and the second participle to obtain a first word vector and a second word vector comprises:

7. An apparatus for calculating similarity of texts, comprising:

the word segmentation processing module is used for acquiring a first text and a second text, and performing word segmentation processing on the first text and the second text respectively to obtain a first word segmentation and a second word segmentation, wherein the first text is a text input by a user;

the relevancy calculation module is used for acquiring a preset standard text, calculating the relevancy of the first participle and the standard text to obtain a first relevancy, and calculating the relevancy of the second participle and the standard text to obtain a second relevancy, wherein the second text is any one text in the standard text;

a similarity calculation module for calculating a text similarity between the first text and the second text based on the first sentence vector and the second sentence vector;

the correlation calculation module specifically includes:

8. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed performs the steps of a method of similarity calculation of text as claimed in any one of claims 1 to 6.

9. A computer-readable storage medium, having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of a method of similarity calculation of text as claimed in any one of claims 1 to 6.