CN117076946A

CN117076946A - Short text similarity determination method, device and terminal

Info

Publication number: CN117076946A
Application number: CN202311055331.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Nanjing Langtuo Technology Investment Co ltd
Current assignee: Nanjing Langtuo Technology Investment Co ltd
Priority date: 2023-08-22
Filing date: 2023-08-22
Publication date: 2023-11-17

Abstract

The embodiment of the invention discloses a method, a device and a terminal for determining similarity of short texts, which are characterized in that feature templates of two short texts are extracted respectively in a word pair extraction mode, the extracted feature templates are counted by utilizing clustering combination, a feature template array and a template frequency array are constructed, finally, a relationship similarity calculation model is utilized, and similarity calculation of the two short texts is calculated based on the feature template array and the template frequency array, so that the similarity of the two short texts is determined. Semantic enhancement is realized by constructing a feature template array and a template frequency array, and the matching degree of the actual semantics of the extracted feature template and the context of the short text is improved, so that the accuracy and the efficiency of the semantic similarity of the short text are improved.

Description

Short text similarity determination method, device and terminal

Technical Field

The present invention relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, and a terminal for determining similarity of short text.

Background

With the development of artificial intelligence technology, especially the investigation technology of information similarity, has become one of the main technologies for classifying information/text. At present, the recognition of the text similarity is mainly realized through a natural language NLP (non-linear language), for example, semantic retrieval is performed through a natural language technology, and whether the two are similar is judged based on the semantics, but the current semantic retrieval scheme only can analyze the semantics of the word in the current paragraph, and context association is not performed, so that the semantics are inaccurate, and the calculation error of the similarity is caused.

Disclosure of Invention

Based on the method, the device, the chip and the terminal for determining the similarity of the short text, the accuracy and the efficiency of the semantic similarity of the short text can be improved.

In a first aspect, a method for determining similarity of short text is provided, including:

acquiring two short texts to be processed, and extracting entry files of each short text, wherein the entry files comprise a plurality of word pairs;

carrying out phrase extraction on the short text corresponding to the entry file to obtain a phrase array, wherein each phrase in the phrase array at least comprises at least one word in the word pair;

taking word pairs in the entry file as segmentation units, segmenting each phrase in the phrase array, and eliminating semantic-free words in the segmented phrases to obtain corresponding feature templates;

clustering and merging the feature templates corresponding to the two short texts, and counting the frequency of each feature template in the phrase array corresponding to the feature templates to generate a corresponding feature template array and a template frequency array;

and calculating the similarity of the two short texts by utilizing a pre-trained relationship similarity calculation model based on the feature template array and the template frequency array corresponding to the two short texts, wherein the relationship similarity calculation model is a similarity calculation model obtained by training a semantic relevance algorithm based on a semantic dictionary to a word pair relationship similarity algorithm based on statistics.

Optionally, the obtaining two short texts to be processed, extracting an entry file of each short text, includes:

erecting a local mirror image platform of the wikipedi by using the Mediawiki engine software;

performing word sense recognition on two short texts obtained from a known short text database by using a preset semantic recognition model, and screening out words with semantic meaning of a stem word and a selected term word based on recognition results;

inputting the screened words into a local mirror platform for mirror image processing to obtain mirror image words of each word, wherein the mirror image words and the corresponding words are words with the same or similar semantic meaning;

and constructing word pairs by the words and the corresponding mirror image words to obtain an entry file.

Optionally, the phrase extracting, based on the term file, the corresponding short text to obtain a phrase array includes:

each word pair in the entry file is used as an index, and sentences which take the words of the word pair as a starting word and/or an ending word in the corresponding short text are searched;

judging whether the length of the sentence meets a length range threshold value or not, and constructing a phrase array for the sentence within the length range threshold value based on a judging result.

Optionally, the determining whether the length of the sentence meets the length range threshold, and constructing a phrase array for the sentence within the length range threshold based on the determination result includes:

judging whether the length of a sentence taking the first word in the word pair as an ending word does not exceed a first threshold area;

judging whether the length of a sentence taking a second word in the word pair as an ending word does not exceed a second threshold area;

sentences which do not exceed the first threshold area and/or sentences which do not exceed the second threshold area are extracted to serve as phrases, and corresponding phrase arrays of short texts are constructed.

Optionally, the step of using word pairs in the term file as a segmentation unit to segment each phrase in the phrase array and eliminating semantic-free words in the segmented phrases to obtain corresponding feature templates includes:

determining a target word pair in the entry file, and extracting target phrases corresponding to a first word and a second word of the target word pair in the phrase array;

marking the positions of a first word and a second word in the target phrase;

determining parts of speech of the first word and the second word, and determining stop words and connective words between the first word and the second word based on the parts of speech;

And eliminating the stop words in the target phrase to obtain the corresponding feature templates.

Optionally, the clustering and merging are performed on the feature templates corresponding to the two short texts, and the frequencies of the feature templates in the phrase arrays corresponding to the feature templates are counted, so as to generate corresponding feature template arrays and template frequency arrays, including:

counting the number of feature templates with the same word pair in each short text, and calculating the occurrence frequency of the corresponding feature templates based on the number;

identifying whether the same connecting word exists between the feature templates with the same word pair;

if the feature templates exist, combining the two corresponding feature templates, adding the corresponding frequencies to obtain the frequencies of the combined feature templates in the corresponding phrase arrays, and generating a corresponding feature template array and a template frequency array.

Optionally, the calculating the similarity of the two short texts based on the feature template arrays and the template frequency arrays corresponding to the two short texts by using a relationship similarity calculation model obtained by pre-training includes:

determining a connective word array corresponding to each word pair based on the feature template array corresponding to each short text;

based on the stem words and the option words in the word pairs, determining corresponding connecting words from the connecting word array, inputting the determined connecting words into two input ends of a relational similarity calculation model which is obtained through pre-training, and carrying out vector filling on the word pairs through the relational similarity calculation model to obtain corresponding high-order dense vectors;

Determining sharing weights based on the twin network in the relation similarity calculation model, and calculating feature weights of the main semantic information of the two short texts based on the sharing weights;

and calculating the distance between the high-order dense vectors corresponding to the two word pairs according to the characteristic weight to obtain the similarity of the two short texts.

In a second aspect, there is provided a similarity determining apparatus for short texts, including:

the first extraction module is used for obtaining two short texts to be processed and extracting entry files of the short texts, wherein the entry files comprise a plurality of word pairs;

the second extraction module is used for carrying out phrase extraction on the short text corresponding to the entry file to obtain a phrase array, wherein each phrase in the phrase array at least comprises at least one word in the word pair;

the rejecting module is used for dividing each phrase in the phrase array by taking the word pairs in the entry file as dividing units, and rejecting semantic-free words in the divided phrases to obtain corresponding feature templates;

the clustering module is used for carrying out cluster combination on the feature templates corresponding to the two short texts, counting the frequency of each feature template in the corresponding phrase array, and generating a corresponding feature template array and a template frequency array;

The calculating module is used for calculating the similarity of the two short texts by utilizing a pre-trained relationship similarity calculating model based on the characteristic template array and the template frequency array corresponding to the two short texts, wherein the relationship similarity calculating model is a similarity calculating model obtained by applying a semantic relevance algorithm based on a semantic dictionary to a word pair relationship similarity algorithm based on statistics and adding training.

Optionally, the first extraction module is specifically configured to:

Optionally, the second extraction module is specifically configured to:

Optionally, the deletion module is specifically configured to:

marking the positions of a first word and a second word in the target phrase;

Optionally, the clustering module is specifically configured to:

Optionally, the computing module is specifically configured to:

In a third aspect, a chip is provided, comprising a first processor for calling and running a computer program from a first memory, such that a device on which the chip is mounted performs the steps of the short text similarity determination method as described above.

In a fourth aspect, a terminal is provided, comprising a second memory, a second processor and a computer program stored in said memory and executable on said processor, the second processor implementing the steps of the short text similarity determination method as described above when said computer program is executed.

According to the short text similarity determining method, device, chip and terminal, the feature templates of two short texts are extracted respectively in a word pair extracting mode, the extracted feature templates are counted by utilizing clustering combination, a feature template array and a template frequency array are constructed, finally a relationship similarity calculating model is utilized, similarity calculation of the two short texts is calculated based on the feature template array and the template frequency array, and therefore the similarity of the two short texts is determined. Semantic enhancement is realized by constructing a feature template array and a template frequency array, and the matching degree of the actual semantics of the extracted feature template and the context of the short text is improved, so that the accuracy and the efficiency of the semantic similarity of the short text are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a first flowchart of a short text similarity determining method according to an embodiment of the present invention;

FIG. 2 is a block flow diagram of a method for determining similarity of short text according to an embodiment of the present invention;

FIG. 3 is a second flowchart of a method for determining similarity of short text according to an embodiment of the present invention;

FIG. 4 is a training flowchart of a relationship similarity calculation model according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a relational similarity calculation model according to an embodiment of the present invention;

fig. 6 is a basic structural block diagram of a short text similarity determining device according to an embodiment of the present invention;

fig. 7 is a basic structural block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present invention with reference to the accompanying drawings.

In some of the flows described in the specification and claims of the present invention and in the above figures, a plurality of operations appearing in a particular order are included, but it should be clearly understood that the operations may be performed in other than the order in which they appear herein or in parallel, the sequence numbers of the operations such as S101, S102, etc. are merely used to distinguish between the various operations, and the sequence numbers themselves do not represent any order of execution. In addition, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first" and "second" herein are used to distinguish different messages, devices, modules, etc., and do not represent a sequence, and are not limited to the "first" and the "second" being different types.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention based on the embodiments of the present invention.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among them, artificial intelligence (AI: artificial Intelligence) is a theory, method, technique and application system that simulates, extends and expands human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, acquires knowledge and uses the knowledge to obtain the best result.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Referring specifically to fig. 1, fig. 1 is a basic flow diagram of a short text similarity determining method according to the present embodiment.

As shown in fig. 1, a short text similarity determining method includes:

s110, acquiring two short texts to be processed, and extracting entry files of the short texts, wherein the entry files comprise a plurality of word pairs.

In this embodiment, the short text may be extracted from a picture or a document, if the short text is obtained by taking a picture, identifying the text in the picture by OCR, inputting the identified text into a neural network for sentence extraction to segment, so as to obtain a corresponding short text, where the short text is marked with a corresponding sentence mark.

Specifically, the document extracted by OCR is input into a neural network, punctuation marks in the document are identified through the neural network, preferably punctuation marks consistent with a preset sentence ending identifier in the document are identified, and then the content between the two punctuation marks is marked, so that a complete sentence is obtained.

Further, the preset part-of-speech recognition model is utilized to carry out semantic recognition on the words or the characters in each sentence, and the part-of-speech of each word or each character is determined through the semantics so as to obtain the entry file corresponding to each sentence.

S120, phrase extraction is carried out on the corresponding short text based on the entry file, a phrase array is obtained, and each phrase in the phrase array at least comprises at least one word in the word pair.

In this embodiment, the terms in the term file are extracted, the terms are used to extract the phrases from the corresponding short text, specifically, when the terms are extracted, the terms are used as basic points, the continuous phrases with certain length are intercepted from the front and the back of the terms, then semantic reasoning is performed on the terms and the intercepted phrases by using a semantic analysis network, whether the inferred semantics are complete or not is judged, if the inferred terms are complete, the intercepted terms are combined with the terms to obtain the terms of the terms, and finally, all the terms extracted from the short text are sequenced into the tunnel phrase array according to the sequencing order of the terms in the term file.

S130, using word pairs in the entry file as segmentation units, segmenting each phrase in the phrase array, and eliminating semantic-free words in the segmented phrases to obtain corresponding feature templates.

In the step, word segmentation is carried out on each phrase in a phrase array by utilizing a word segmentation algorithm through word pairs in a word entry file, specifically, word pairs in the word entry file are taken as indexes, words which are the same as the words in the word pairs in the phrase are searched for and subjected to marking segmentation, residual words or words except the word pairs in the word entry file in each phrase are extracted, the residual words or words are analyzed by utilizing a semantic recognition model to determine the semantics of the residual words or words, whether the semantics are words or words with parts of speech such as dead words, quantitative words and the like or not is judged, namely, no-semantic words are judged, if yes, highlighting is carried out on the corresponding phrases, and the residual words or words are removed. And finally, vectorizing the removed phrases to obtain corresponding feature templates.

S140, clustering and merging the feature templates corresponding to the two short texts, and counting the frequency of each feature template in the corresponding phrase array to generate a corresponding feature template array and a template frequency array.

In this embodiment, clustering the feature templates corresponding to each short text specifically includes counting the number of the same words or word pairs in the feature templates, determining that two feature templates are the same when the ratio of the number reaches a preset ratio value, merging, taking out the same words or word pairs when merging, then semantically combining the remaining words by using a semantic recognition model to obtain a new feature template, and finally calculating the flatness of the new feature template based on the number of the words before merging to generate new feature template data and a template frequency array.

S150, calculating the similarity of the two short texts based on the feature template array and the template frequency array corresponding to the two short texts by using a relationship similarity calculation model obtained through pre-training.

In this embodiment, the relationship similarity calculation model is a similarity calculation model obtained by applying a semantic relevance algorithm based on a semantic dictionary to a word pair relationship similarity algorithm based on statistics and training.

In this embodiment, the relation similarity calculation model is a model obtained by applying a semantic dictionary-based semantic relevance algorithm to a statistical-based word pair relation similarity algorithm, combining the two algorithms together according to a certain form to form a semantic enhancement-based word pair relation similarity algorithm, and then training the algorithm by collecting a training data set.

By implementing the method, similarity calculation of two short texts is calculated based on the characteristic template array and the template frequency array, so that the similarity of the two short texts is determined. Semantic enhancement is realized by constructing a feature template array and a template frequency array, and the matching degree of the actual semantics of the extracted feature template and the context of the short text is improved, so that the accuracy and the efficiency of the semantic similarity of the short text are improved.

As shown in fig. 2 and 3, another example of the short text similarity determining method provided by the application includes an algorithm framework of a preprocessing module, a phrase searching module, a feature extracting template connective module, a relation similarity calculating module and the like, and as shown in fig. 2, the method proposed based on the algorithm framework of fig. 2 specifically includes the following steps:

s310, acquiring two short texts to be processed, and extracting an entry file of each short text, wherein the entry file comprises a plurality of word pairs.

In this embodiment, for the above-mentioned process of extracting the term file, it may be specifically implemented in the following manner:

Specifically, after extracting the words in the short text, carrying out semantic recognition and mirror image retrieval on the words by utilizing a local mirror image platform and a semantic recognition model to retrieve more synonyms or paraphrasing words, then selecting a composition word pair close to the semantics of the words from the synonyms or paraphrasing words, and finally forming an entry file, namely the specific steps are as follows:

firstly, downloading a data packet;

secondly, setting up a local mirror image of wikipedia by using the Mediawiki engine software, importing downloaded data, and finishing initialization setting;

thirdly, collecting related vocabulary entries, searching by taking word pairs as units (comprising one stem word pair and five option word pairs) as keywords in a corpus search box, and storing the returned vocabulary entries in the format of an XML document to lead to specified positions to obtain vocabulary entry files.

S320, searching sentences which take words of the word pairs as a starting language and/or an ending language in the corresponding short text by taking the word pairs in the entry file as indexes.

S330, judging whether the length of the sentence meets the length range threshold, and constructing a phrase array for the sentence within the length range threshold based on the judging result.

The method is realized by the following steps:

In practical application, the method is obtained by searching sentences based on word pairs in the entry file, and the specific flow is as follows:

firstly, taking a stored entry file as input, and finding out phrases comprising corresponding word pairs in the entry file, namely inputting the word pairs and the entry file corresponding to the word pairs generated by the first module;

second, find the phrase comprising input word pair (A: B) and beginning with the word and ending with word A in the middle, if the phrase length does not exceed two and is not less than 2, save the phrase into array list;

third, find the phrase that contains the input word pair and begin with word and end with word B in the middle, if the phrase length does not exceed max_phrase and is not less than 2, save the phrase into the array list.

S340, using word pairs in the entry file as segmentation units, segmenting each phrase in the phrase array, and eliminating semantic-free words in the segmented phrases to obtain corresponding feature templates.

The method is realized by the following steps:

marking the positions of a first word and a second word in the target phrase;

In practical application, the extracted original phrase is processed to form a feature template capable of expressing semantic relations, and the method is concretely realized as follows:

first, inputting word pairs (A: B) and corresponding phrase arrays list thereof;

secondly, respectively replacing the input words in each element with a word X and a word Y, and deleting the stop word in the words to form a characteristic template;

thirdly, merging the same feature templates in the list to form an array pattern, and recording the occurrence frequency of each feature template in an array result;

Fourth, calculate the synonym of the junction word in the middle of the template by means of word net, cluster the template with junction word, and add the corresponding frequency of these templates. The clustered templates are stored in an array patternNew, and the template frequencies are stored in an array resultNew;

fifth, patternNew and resultNew are output.

S350, clustering and merging the feature templates corresponding to the two short texts, and counting the frequency of each feature template in the corresponding phrase array to generate a corresponding feature template array and a template frequency array.

Generating a feature template array and a template frequency array, specifically, counting the number of feature templates with the same word pair in each short text, and calculating the occurrence frequency of the corresponding feature templates based on the number;

Specifically, the semantic relativity between every two connecting words corresponding to the two words respectively is calculated; after obtaining all the actually significant connective words of each word pair, calculating semantic relativity between all connective words in the middle of the feature templates corresponding to the word pairs and all connective words in the middle of the feature templates corresponding to the word pairs, wherein the specific flow is as follows:

Firstly, inputting a connective word array wordlist and a frequency array numberlist corresponding to each word;

secondly, the words in the connecting word array corresponding to the stem word pair and the option word pair are corresponding to each other to form a word pair;

thirdly, calculating semantic relativity between all word pairs based on the wordnet by using an algorithm glossvector;

fourth, the semantic relativity between every two of all the connective words is output.

S360, calculating the similarity of the two short texts based on the feature template array and the template frequency array corresponding to the two short texts by using a relationship similarity calculation model obtained through pre-training.

For similarity calculation, determining a connective word array corresponding to each word pair based on a feature template array corresponding to each short text;

In this embodiment, the relationship similarity calculation model is a similarity calculation model obtained by applying a semantic dictionary-based semantic relevance algorithm to a statistical word pair relationship similarity algorithm, that is, by applying the semantic dictionary-based semantic relevance algorithm to the statistical word pair relationship similarity algorithm, the two algorithms are combined together according to a certain form to form a semantic enhancement-based word pair relationship similarity algorithm, abbreviated as SSR (Satistic Semantic Relation), that is, the semantic relevance between the words is calculated by adopting a semantic resource-based method.

Specifically, the SSR algorithm is obtained through training the following flow:

step 1, collecting and preprocessing a short text data set;

step 2, training and calculating based on a text similarity calculation model of the COV-BIGRU;

step 3, comparing the obtained distance with a similarity measurement value in the data set;

and 4, reversely transmitting weight parameters of the updated model to achieve a final effect.

Step 1, processing a data set, wherein firstly, stop words in the data set are removed, and the stop words comprise a total of 179 words such as a human pronoun, a preposition and the like;

Secondly, inputting a text by a model, namely fixing a length input, and then selecting a proper length as the length of the text to be intercepted by statistics, so that the length of each text can be intercepted and supplemented, namely if the length is longer than the length, a long text part is intercepted, and if the length is shorter than the length, zero vector filling is carried out, so that normalization processing is realized;

furthermore, the number-en word vector library is used for carrying out preliminary vectorization on words in the text, namely, words in each section of text are replaced by 300-dimensional vectors in the vector library, so that the text is converted into high-order dense vectors, and then the deep learning model is used for carrying out further vectorization on the text, so that vectorization processing is realized;

finally, the data fusion is directly carried out according to element input addition, the programming language used in the process is python, the tool package used is Pandas, in the new fused data set, three data sets, namely a Quora data set, a Sick data set and an MSRP data set, are used, in the new data set, only four features, namely id, word1, word2 and sim, are contained, and in the Quora data set, qid1 and qid2 are abandoned, id corresponds to id, query 1 corresponds to word1, query 2 corresponds to word2 and is_duplicate corresponds to sim.

Further, the processed data set is input into a text similarity calculation model of the COV-BIGRU for training, wherein the COV-BIGRU comprises an embedded layer, a convolution layer, a bidirectional GRU layer, a pooling layer, a full connection layer and a similarity calculation layer, and a training flow based on the COV-BIGRU structure is shown in fig. 4 and 5:

text input: two text pairs in the dataset containing similar information;

vector filling: vector filling is carried out on the text by using an Embedding library, and a high-order dense vector is formed after filling;

semantic information extraction: extracting important semantic information from the text by using a convolutional neural network layer, and adding higher weight to main information in the text after extracting;

extracting sequence information: extracting sequence information in the text through a bidirectional gating circulation unit, and integrating important semantic information from the upper part to the lower part in the text and from the lower part to the upper part in the text;

and (3) carrying out maximum pooling treatment: carrying out maximum pooling treatment to reasonably compress the feature quantity, thereby achieving the effect of reducing the dimension of the data;

connecting a full connection layer: the method comprises the steps of integrating and assembling local features by connecting a full-connection layer at the last layer of a text encoder, and generating one-dimensional semantic vectors of texts by the assembly operation of weight matrixes;

Calculating the distance: the distance between the generated text vectors is judged by Manhattan distance calculation, the obtained distance is compared with a similarity measurement value in a data set to reversely propagate a weight parameter for updating a model, namely, when the similarity characteristics of two pieces of text in the data set are judged to be similar and the obtained distance in the similarity calculation is not 1, the distance is reversely propagated back and the whole model parameter is one step away in a direction close to 1, if the distance is not 0, the distance is reversely propagated back and the weight parameter of the whole model is updated to enable the model to be one step away in a direction of 0, and finally, the distance of two pieces of data is output;

the Manhattan distance calculating method comprises the steps of subtracting each dimension between two vectors, taking an absolute value, summing the absolute values, and taking the absolute value after subtracting each dimension, wherein the absolute value represents the distance of the vector in the dimension. The Manhattan distance calculation formula is as follows;

d(V,V`)＝∑|V _X -V` _X |+|V _Y -V` _Y |

where V, V 'represents two text segments and d represents distance, so d (V, V') represents distance between the text segments, VX, VY, V, Y ', VX' represents vector values for each dimension in the text vector.

In summary, the method provided by the embodiment extracts word pairs to extract two short texts, performs feature template extraction respectively, performs statistics on the extracted feature templates by utilizing cluster merging, constructs a feature template array and a template frequency array, and finally calculates similarity calculation of the two short texts based on the feature template array and the template frequency array by utilizing a relationship similarity calculation model, so that the problem of sparse short text data can be well solved, and meanwhile, context semantics can be well taken into account.

In order to solve the technical problems, the embodiment of the invention also provides a device for determining the similarity of the short text. Referring specifically to fig. 6, fig. 6 is a basic block diagram of a similarity determining apparatus for short text according to the present embodiment, including:

a first extraction module 610, configured to obtain two short texts to be processed, and extract an entry file of each short text, where the entry file includes a plurality of word pairs;

a second extracting module 620, configured to extract phrases from the corresponding short text based on the term file, to obtain a phrase array, where each phrase in the phrase array at least includes at least one word in the word pair;

The rejecting module 630 is configured to segment each phrase in the phrase array by using the word pairs in the term file as a segmentation unit, and reject the semantic-free words in the segmented phrases to obtain corresponding feature templates;

the clustering module 640 is configured to cluster and combine the feature templates corresponding to the two short texts, and count frequencies of the feature templates in the phrase arrays corresponding to the feature templates, so as to generate a corresponding feature template array and a template frequency array;

the calculating module 650 is configured to calculate the similarity of the two short texts by using a pre-trained relational similarity calculation model based on the feature template array and the template frequency array corresponding to the two short texts, where the relational similarity calculation model is a similarity calculation model obtained by applying a semantic relevance algorithm based on a semantic dictionary to a word pair relational similarity algorithm based on statistics and training.

In some embodiments, the first extraction module 610 is specifically configured to:

In some embodiments, the second extraction module 620 is specifically configured to:

In some embodiments, the deletion module 630 is specifically configured to:

marking the positions of a first word and a second word in the target phrase;

In some embodiments, the clustering module 640 is specifically configured to:

In some embodiments, the computing module 650 is specifically configured to:

In summary, in this embodiment, feature templates are extracted from two short texts by extracting word pairs, and statistics is performed on the extracted feature templates by using cluster merging, so as to construct a feature template array and a template frequency array, and finally, a relationship similarity calculation model is used to calculate similarity calculation of the two short texts based on the feature template array and the template frequency array, so as to determine similarity of the two short texts. Semantic enhancement is realized by constructing a feature template array and a template frequency array, and the matching degree of the actual semantics of the extracted feature template and the context of the short text is improved, so that the accuracy and the efficiency of the semantic similarity of the short text are improved.

In order to solve the above technical problems, the embodiment of the present application further provides a chip, where the chip may be a general-purpose processor or a special-purpose processor. The chip comprises a processor for supporting the terminal to perform the above related steps, e.g. to call and run a computer program from a memory, so that a device on which the chip is mounted performs the above method for short text similarity determination in the above embodiments.

Optionally, in some examples, the chip further includes a transceiver, where the transceiver is controlled by the processor, and is configured to support the terminal to perform the related steps to implement the short text similarity determining method in the foregoing embodiments.

Optionally, the chip may further comprise a storage medium.

It should be noted that the chip may be implemented using the following circuits or devices: one or more field programmable gate arrays (field programmable gate array, FPGA), programmable logic devices (programmablelogic device, PLD), controllers, state machines, gate logic, discrete hardware components, any other suitable circuit or combination of circuits capable of performing the various functions described throughout this application.

The application also provides a terminal comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor realizes the steps of the short text similarity determination method provided by the embodiment when executing the computer program.

Referring specifically to fig. 7, fig. 7 is a basic block diagram illustrating a terminal including a processor, a nonvolatile storage medium, a memory, and a network interface connected by a system bus. The nonvolatile storage medium of the terminal stores an operating system, a database and a computer readable instruction, the database can store a control information sequence, and the computer readable instruction can enable the processor to realize a short text similarity determination method when the computer readable instruction is executed by the processor. The processor of the terminal is operative to provide computing and control capabilities supporting the operation of the entire terminal. The memory of the terminal may have stored therein computer readable instructions that, when executed by the processor, cause the processor to perform a short text similarity determination method. The network interface of the terminal is used for connecting and communicating with the terminal. It will be appreciated by persons skilled in the art that the structures shown in the drawings are block diagrams of only some of the structures associated with the aspects of the application and are not limiting of the terminals to which the aspects of the application may be applied, and that a particular terminal may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

As used herein, a "terminal" or "terminal device" includes both a device of a wireless signal receiver having no transmitting capability and a device of receiving and transmitting hardware having electronic devices capable of performing two-way communication over a two-way communication link, as will be appreciated by those skilled in the art. Such an electronic device may include: a cellular or other communication device having a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; a PCS (Personal Communications Service, personal communication system) that may combine voice, data processing, facsimile and/or data communication capabilities; a PDA (Personal Digital Assistant ) that can include a radio frequency receiver, pager, internet/intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System ) receiver; a conventional laptop and/or palmtop computer or other appliance that has and/or includes a radio frequency receiver. As used herein, "terminal," "terminal device" may be portable, transportable, installed in a vehicle (aeronautical, maritime, and/or land-based), or adapted and/or configured to operate locally and/or in a distributed fashion, to operate at any other location(s) on earth and/or in space. The "terminal" and "terminal device" used herein may also be a communication terminal, a network access terminal, and a music/video playing terminal, for example, may be a PDA, a MID (Mobile Internet Device ), and/or a mobile phone with a music/video playing function, and may also be a smart tv, a set top box, and other devices.

The present invention also provides a storage medium storing computer readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the short text similarity determination method of any of the embodiments described above.

The present embodiment also provides a computer program which can be distributed on a computer readable medium and executed by a computable device to implement at least one step of the short text similarity determination method described above; and in some cases at least one of the steps shown or described may be performed in a different order than that described in the above embodiments.

The present embodiment also provides a computer program product comprising computer readable means having stored thereon a computer program as shown above. The computer readable means in this embodiment may comprise a computer readable storage medium as shown above.

Those skilled in the art will appreciate that implementing all or part of the above-described methods in accordance with the embodiments may be accomplished by way of a computer program stored in a computer-readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a nonvolatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a random access Memory (Random Access Memory, RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, these steps are not necessarily performed in order as indicated by the arrows. The steps are not strictly limited in order and may be performed in other orders, unless explicitly stated herein. Moreover, at least some of the steps in the flowcharts of the figures may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order of their execution not necessarily being sequential, but may be performed in turn or alternately with other steps or at least a portion of the other steps or stages.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. A short text similarity determination method, comprising:

2. The method for determining similarity of short texts according to claim 1, wherein said obtaining two short texts to be processed, extracting an entry file of each short text, comprises:

3. The method for determining similarity of short text according to claim 1, wherein the phrase extracting the corresponding short text based on the term file to obtain a phrase array comprises:

4. The short text similarity determination method according to claim 3, wherein the judging whether the length of the sentence satisfies a length range threshold, and constructing a phrase array for sentences lying within the length range threshold based on the judgment result, comprises:

5. The method for determining similarity of short text according to claim 4, wherein the step of using word pairs in the term file as a segmentation unit to segment each phrase in the phrase array and eliminating semantic-free words in the segmented phrases to obtain corresponding feature templates includes:

Marking the positions of a first word and a second word in the target phrase;

6. The method for determining similarity of short text according to claim 5, wherein the step of clustering and merging the feature templates corresponding to two short texts and counting frequencies of the feature templates in the phrase arrays corresponding to the feature templates to generate the corresponding feature template arrays and the template frequency arrays comprises:

7. The method for determining similarity of short texts according to claim 6, wherein said calculating the similarity of two short texts based on the feature template arrays and the template frequency arrays corresponding to two short texts by using a pre-trained relational similarity calculation model comprises:

8. A similarity determination device for short texts, comprising:

9. A terminal comprising a second memory, a second processor and a computer program stored in the memory and executable on the processor, characterized in that the second processor implements the steps of the short text similarity determination method according to any one of claims 1 to 7 when the computer program is executed.