CN114490932B

CN114490932B - Semantic speculation method based on text similarity and keywords

Info

Publication number: CN114490932B
Application number: CN202210069600.8A
Authority: CN
Inventors: 岳希; 鲁晓浩; 何磊; 唐聃; 曾琼; 罗涵
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2022-08-23
Anticipated expiration: 2042-01-21
Also published as: CN114490932A

Abstract

The invention discloses a semantic conjecture method based on text similarity and keywords, which comprises the steps of carrying out word segmentation processing on a text stored in a database to obtain a plurality of irreducible words, judging the number of characters of each irreducible word in the database, and establishing a phrase dictionary and a vocabulary dictionary; performing word segmentation processing on an input text of a user to obtain a plurality of irrelative words, and judging the number of characters of each irrelative word in the input text to obtain a phrase list and a vocabulary list; matching the second phrase in the phrase dictionary, and calculating a first matching degree: if the first matching degree is 1, outputting a matching text as a final guess result; and if the first matching degree is less than 1, outputting one or more texts with the highest first matching degree as a conjecture result. The invention provides a semantic inference method based on text similarity and keywords, which aims to solve the problems that the operation amount is too large, an inference result is not expected and the like in the prior art and achieve the purposes of reducing the necessary operation amount and improving the accuracy of the inference result.

Description

Semantic speculation method based on text similarity and keywords

Technical Field

The invention relates to the field of semantic speculation, in particular to a semantic speculation method based on text similarity and keywords.

Background

With the rapid development of information technology and network technology, network information can also rapidly increase, and simultaneously, the mass increase of data volume is brought. At present, network data has increasingly penetrated in the aspect of daily life of people, and text similarity matching algorithms can be used wherever knowledge data or information is processed. The inference and analysis of semantics can also be based on text similarity.

The existing methods for calculating the similarity of texts comprise methods for calculating the editing distance between two texts, such as fuzzy matching; the similarity of the two vectors is judged by calculating the distance between the two vector groups so as to judge the similarity of the text; there is also a Jacard similarity coefficient, i.e. the similarity coefficient is calculated by calculating the ratio of the intersection and union of the two texts. However, for fuzzy matching, if the user input text is greatly different from the target text, the operations of adding and deleting the text are excessive. For the method of forming vectors such as cosine distance, the requirement for word segmentation processing is high, and when the method faces the situation similar to the situation that the database and the user input are both irrevocable words, the vectors cannot be formed to calculate the similarity. For the application of the jaccard coefficient algorithm, which is directly applied to the irreducible word sense inference, if the two words are not directly identical then the union does not exist, also 0, so the inference result is also not as expected.

Disclosure of Invention

The invention provides a semantic inference method based on text similarity and keywords, which aims to solve the problems that the operation amount is too large, an inference result is not expected and the like in the prior art and achieve the purposes of reducing unnecessary operation amount and improving the accuracy of the inference result.

The invention is realized by the following technical scheme:

the semantic inference method based on text similarity and keywords comprises the following steps:

s1, performing word segmentation processing on the text stored in the database to obtain a plurality of irrevocable words, and judging the number N of the characters of each irrevocable word in the database:

if N is more than or equal to 3, storing the irreducible word segmentation in a phrase dictionary in a key value pair mode; defining an irrevocable word in a phrase dictionary as a first phrase;

if N is less than or equal to 2, storing the irreducible segmentation words in a vocabulary dictionary in a key value pair mode; defining an irrevocable word in a vocabulary dictionary as a first vocabulary;

s2, performing word segmentation processing on the input text of the user to obtain a plurality of irreducible words, and judging the character number M of each irreducible word in the input text:

if M is more than or equal to 3, storing the irreducible word segmentation in a phrase list in a key value pair mode; defining an irrevocable word in the phrase list as a second phrase;

if M is less than or equal to 2, storing the irreducible word in a vocabulary list in a key value pair mode; defining an irrevocable word in the vocabulary list as a second vocabulary;

s3, matching the second phrase in the phrase dictionary, and calculating a first matching degree:

if the first matching degree is 1, outputting a matching text as a final guess result;

and if the first matching degree is less than 1, outputting one or more texts with the highest first matching degree as a conjecture result.

The invention provides a semantic inference method based on text similarity and keywords, which aims at solving the problems that when semantic inference is carried out in the prior art, fuzzy matching operation is excessively carried out, and inference results are not expected due to the adoption of other algorithms such as cosine distance and the like. The method comprises the steps of firstly processing text data in a database, dividing all texts in the database into a plurality of irrevocable word divisions through word division processing, wherein the irrevocable word divisions in the application refer to word groups which can not be subjected to word division after the word division processing, and the word groups can be phrases, words or characters. For the irrevocable word segmentation after the word segmentation processing in the database, judging the number of characters of each irrevocable word segmentation: if the number of the characters is more than or equal to 3, the characters are considered as phrases, and the phrases are stored in a phrase dictionary in a key value pair mode; if the number of the characters is less than or equal to 2, the characters are considered as words, and the words are stored in a word dictionary in a key value pair mode; all phrases in the phrase dictionary are defined as a first phrase and all words in the word dictionary are defined as a first word. Then, detecting user input, segmenting the text input by the user into irrevocable states, and judging the number of characters of each irrevocable segment in the input text: if the number of the characters is more than or equal to 3, the characters are regarded as phrases, and the phrases are stored in a phrase list in a key-value pair mode; if the number of the characters is less than or equal to 2, the characters are considered as words, and the words are stored in a word list in a key value pair mode; all phrases in the phrase list are defined as a second phrase and all words in the word list are defined as a second word. And then matching each second phrase with each first phrase in the phrase dictionary, and calculating a first matching degree. If the first matching degree is 100%, it indicates that each second phrase can be completely matched in the phrase dictionary, and at this time, the matched text is output as a final guess result. If the first matching degree is less than 1, the second phrases cannot be completely matched in the phrase dictionary, and at the moment, one or more texts with the highest first matching degree are used as a conjecture result to be output.

According to the method and the device, the text in the database and the text input by the user are matched after word segmentation, compared with the distance editing algorithm such as fuzzy matching in the prior art, the matching computation amount can be obviously reduced, and the waste of computation force caused by the condition from short text to long text is avoided.

Further, step S1 includes performing the following key value processing on the first phrase:

defining the first half character and the second half character:

if N is singular, dividing the irrevocable word into a first half character including the central character and a second half character including the central character by taking the central character as a boundary;

if N is a double number, the irrevocable word is equally divided into a front part and a rear part which are respectively a front half part character and a rear half part character;

and respectively taking the former half character and the latter half character as keys, taking the value of the corresponding key in the phrase dictionary of the non-segmentable word as a value, and storing the value in the vocabulary dictionary in the form of a key value pair.

After the text stored in the database is subjected to word segmentation processing, the obtained first phrase is further processed to supplement the integrity of a vocabulary dictionary, and the method specifically comprises the following steps: processing is performed according to whether the number of characters of the first phrase is singular or even: if the number is singular, the first phrase cannot be equally divided according to the characters, so that the character in the middle is taken as a boundary to divide the character into a front part and a rear part, and the front part and the rear part both comprise central characters; if the number is a double number, the character number can be directly divided into two parts, namely a front part and a rear part. According to the scheme, the obtained first half part character and the obtained second half part character are stored in the vocabulary dictionary in a key value pair mode to supplement and perfect the vocabulary dictionary, so that the matching accuracy can be obviously improved and the error can be reduced when the vocabulary needs to be subjected to similar calculation in the follow-up process.

Further, in step S3, the method for calculating the first matching degree includes:

s301, sequencing the second phrases according to the text sequence, and sequentially setting the weight of each second phrase to be A ₁ ，A ₂ ，…A _i Let A be ₁ Maximum; wherein i is a second phrase number; and when i is greater than or equal to 3, A _i ≤A _(i-1) ；

S302, extracting the second phrases one by one, and matching the second phrases with the phrase dictionary to obtain the matching degree n of each second phrase ₁ ，n ₂ ，…n _i ；

S303, calculating a first matching degree: n is A ₁ ×n ₁ +A ₂ ×n ₂ +…+A _i ×n _i 。

The scheme provides a specific calculation method of the first matching degree, firstly, the obtained second phrases are sequenced according to the sequence of the second phrases in the text input by the user, then, each second phrase is given a weight in sequence, and the weight A of the second phrase sequenced at the first position is enabled to be ₁ The maximum, the weighted values of the other second phrases are all less than A ₁ . Starting with the third ranked second phrase, the weight of the next second phrase is less than or equal to the weight of the previous second phrase. The specific setting mode of the weight can be adaptively set by technicians in the field according to specific operating conditions on the premise of meeting the requirements of the scheme, and the core idea is that the second phrase is ranked more forward, and the weight of the second phrase is correspondingly larger. And finally, summing the products of the matching degree of each second phrase and the corresponding weight of the second phrase to obtain the numerical value of the first matching degree.

Further, in step S302:

if the second phrase has a direct matching item in the phrase dictionary, the matching degree of the corresponding second phrase is 1;

if the second phrase does not have a direct matching item in the phrase dictionary, character similarity calculation is adopted to determine the corresponding second phrase matching degree, namely, the corresponding second phrase matching degree is determined through the similarity calculation of characters in the second phrase.

Further, the character similarity calculation includes the following steps:

s3021, splitting the second phrase without the direct matching item into single characters, and matching each character with the single character in each first phrase one by one; if a certain character is successfully matched in a certain first phrase, finishing the matching of the first phrase and entering the matching with the next first phrase;

s3022, dividing the number of successfully matched single characters by the total number of characters in the successfully matched second phrase to obtain the similarity of the second phrase to the corresponding first phrase, and extracting the maximum value as the highest similarity;

s3023, updating the weight of the second phrase to (A) _i X highest similarity); wherein A is _i And the original weight corresponding to the second phrase.

The method for calculating the similarity of the characters is further limited in the scheme, firstly, a second phrase without a direct matching item in a phrase dictionary is split by taking the characters as units, then, each split character is sequentially matched with a single character in each first phrase according to the sequence, for a certain character, if the first matching in a certain first phrase is successful, the rest characters in the first phrase can not be matched, the first phrase and the character can be directly considered to be matched, and the matching process of the character and the next first phrase is directly carried out. After all the characters in the second phrase complete the matching process, counting the number of the characters with the matched success items, and dividing the number by the number of the characters of the second phrase to obtain the similarity of the phrase; and extracting the highest similarity, namely obtaining the result which is closest to the user's intention. In addition, in the scheme, the result of the character similarity calculation needs to be addedIn the weight calculation, the original weight A of the second phrase is used _i Multiplying the extracted highest similarity of the second phrase, taking the product as the updated weight of the second phrase, and substituting the updated weight into a calculation formula of the first matching degree to calculate the first matching degree.

Further, the calculating the first matching degree further includes:

s3024, if all the second phrases do not have direct matching items in the phrase dictionary, directly outputting updated weights of all the second phrases, and calculating a first matching degree;

s3025, if at least one second phrase has a direct matching item in the phrase dictionary, adding the weight difference reduced by all second phrases without the direct matching item to the first second phrase with the direct matching item, outputting the updated weight of all second phrases, and calculating the first matching degree.

The method continues to optimize the process of character similarity calculation, and after the character similarity calculation of all the second phrases is completed, whether all the second phrases have no direct matching item in the phrase dictionary is judged:

if yes, outputting updated weights of all second phrases through the methods of the steps S3021 to S3023 to calculate a first matching degree; of course, in this case, the sum of all the second phrase weights may not be one, and those skilled in the art should understand that the processed text with the highest similarity and its position in the database should be used as the closest guess result.

If not, namely at least one second phrase has a direct matching item in the phrase dictionary, and at least one second phrase does not have a direct matching item in the phrase dictionary, then the method of steps S3021 to S3023 is used to update the weight of the second phrase without a direct matching item, and the updated weight inevitably decreases, so that the decreased difference is increased to the second phrase ordered based on the input text, the first second phrase with a direct matching item is output, and then the updated weights of all the second phrases are output, and the first matching degree is calculated.

Further, when there are a plurality of texts with the highest first matching degree, the method further includes:

s4, matching the second vocabulary in the vocabulary dictionary to obtain a second matching degree;

s5, calculating a final matching degree (first matching degree × weight of the first matching degree) + (second matching degree × weight of the second matching degree);

s6, using the final matching degree as the guess value, and outputting one or more texts with the highest guess value as the guess result.

Since the texts in the multiple databases may all correspond to the same highest first matching degree in the present application, the guessed result may still have a large error at this time. The scheme optimizes the guess result by calculating the matching degree of the vocabulary; specifically, the second vocabulary is matched in the vocabulary dictionary to obtain a second matching degree, then the first matching degree and the second matching degree are weighted to obtain a final matching degree, and a text with the highest final matching degree is used as a guess result to be output. Of course, if there are still a plurality of texts with the highest final matching degree, it indicates that the plurality of texts are all closely related to the user input, and even through matching the vocabulary with the number of characters less than or equal to 2 in the irreducible word segmentation, the optimal solution cannot be screened out, and at this time, the plurality of texts can only be output at the same time for the user to select by himself. By the scheme, the conjecture result can be obviously optimized, when a single conjecture text cannot be extracted through the first matching degree, effective secondary conjecture can be performed, conjecture errors are reduced, and unnecessary output of irrelevant conjecture results is reduced.

Further, when the phrase list is an empty set, the weight of the first matching degree is 0, and the weight of the second matching degree is 1;

when the phrase list is not an empty set and the vocabulary list is an empty set, the weight of the first matching degree is 1, and the weight of the second matching degree is 0;

when the phrase list is not an empty set and the vocabulary list is not an empty set, the weight of the first matching degree is 0.6, and the weight of the second matching degree is 0.4.

Further, when there are at least two second words in the word list, the second matching degree is calculated by the following method:

if a second vocabulary has a direct matching item in the vocabulary dictionary, the matching degree of the second vocabulary is 1;

if a second vocabulary does not have a direct matching item in the vocabulary dictionary, determining the matching degree of the second vocabulary by adopting character similarity calculation;

wherein k is the number of the second vocabulary in the vocabulary list, B _k Is the degree of matching of the kth second word, m _k Is the weight of the kth second vocabulary.

The second matching degree is calculated by weighting the matching degrees of all the second words as described above. Of course, the weight of each second vocabulary therein can also be adaptively set by those skilled in the art according to the needs of specific working conditions.

Further, the method for determining the matching degree of the second vocabulary through character similarity calculation comprises the following steps:

splitting a second vocabulary without direct matching items into single characters, and matching each character with a single character in each first vocabulary one by one; if a certain character is successfully matched in a certain first vocabulary, the matching of the first vocabulary is finished, and the matching with the next first vocabulary is entered;

dividing the number of successfully matched single characters by the total number of characters in the successfully matched second vocabulary to obtain the similarity of the second vocabulary relative to the corresponding first vocabulary;

extracting the highest similarity, and updating the weight of the second vocabulary to (m) _k X highest similarity); wherein m is _k The original weight corresponding to the second vocabulary;

if all the second vocabularies do not have direct matching items in the vocabulary dictionary, directly outputting updated weights of all the second vocabularies, and calculating the final matching degree;

if at least one second vocabulary has a direct matching item in the vocabulary dictionary, the weight difference values reduced by all the second vocabularies without the direct matching items are added to the first second vocabulary with the direct matching items, and then updated weights of all the second vocabularies are output to calculate the final matching degree.

The present scheme provides a scheme for calculating the character similarity of the matching degree of the second vocabulary, and the principle thereof is the same as the above-mentioned character similarity calculation process of the matching degree of the second phrase, which is not repeated herein.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the semantic inference method based on text similarity and keywords performs word segmentation processing on the database and user input, can effectively reduce the operand compared with the existing text editing algorithms such as fuzzy matching and the like, and avoids the waste of the operand.

2. The invention provides a semantic conjecture method based on text similarity and keywords, which provides a similarity algorithm of character-by-character matching aiming at irreducible words to obtain whether each character is matched with each word, and provides another new calculation mode except the prior edit distance algorithm when similarity calculation is carried out on two words in the semantic conjecture process.

3. Compared with the existing algorithms such as Jacard similarity coefficient and the like, the semantic inference method based on text similarity and keywords solves the problem that the inferred result is not expected due to the absence of the union set, optimizes the problem that the divisor is too large when the characters are operated, and obviously improves the accuracy of the inferred result.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic flow chart of an embodiment of the present invention;

FIG. 2 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 3 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 4 is a diagram illustrating the speculative result of one embodiment of the present invention;

FIG. 5 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 6 is a diagram illustrating a speculative result in accordance with one embodiment of the present invention;

FIG. 7 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 8 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 9 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 10 is a diagram illustrating a speculative result in accordance with an embodiment of the present invention;

FIG. 11 is a diagram illustrating the estimation result according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention. In the description of the present application, it is to be understood that the terms "front", "back", "left", "right", "upper", "lower", "vertical", "horizontal", "high", "low", "inner", "outer", etc. indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience in describing the present invention and simplifying the description, but do not indicate or imply that the device or element being referred to must have a particular orientation, be constructed in a particular orientation, and be operated, and thus should not be construed as limiting the scope of the present application.

Example 1:

the semantic inference method based on text similarity and keywords as shown in fig. 1 includes the following steps:

s1, performing word segmentation processing on the text stored in the database to obtain a plurality of irreducible words, and judging the character number N of each irreducible word in the database:

s2, performing word segmentation processing on the input text of the user to obtain a plurality of irrelative words, and judging the number M of characters of each irrelative word in the input text:

if M is more than or equal to 3, storing the irreparable word in a phrase list in a key-value pair mode; defining an irrevocable word in the phrase list as a second phrase;

if M is less than or equal to 2, storing the irreducible word in a vocabulary list in a key value pair mode; defining an irreducible word in the vocabulary list as a second vocabulary;

Example 2:

based on the text similarity and the keyword, the semantic inference method further includes, on the basis of embodiment 1, performing key value processing on the first phrase as follows:

defining the first half character and the second half character:

In step S3, the specific method of calculating the first matching degree includes:

Wherein in step S302:

if the second phrase does not have a direct matching item in the phrase dictionary, character similarity calculation is adopted to determine the matching degree of the corresponding second phrase. The character similarity calculation comprises the following steps:

s3021, splitting the second phrase without the direct matching item into single characters, and matching each character with the single character in each first phrase one by one; if a certain character is successfully matched in a certain first phrase, finishing the matching of the first phrase, marking as 1, and entering the matching with the next first phrase; if the matching fails, marking as 0;

s3023, updating the weight of the second phrase to (A) _i X highest similarity); wherein A is _i Is the second phraseThe corresponding original weight.

Example 3:

on the basis of any of the above embodiments, when there are a plurality of texts with the highest first matching degree, the semantic inference method based on text similarity and keywords further includes:

matching the second vocabulary in the vocabulary dictionary to obtain a second matching degree;

calculating a final matching degree (the weight of the first matching degree multiplied by the first matching degree) + (the weight of the second matching degree multiplied by the second matching degree);

and taking the final matching degree as a speculative value, and outputting one or more texts with the highest speculative value as a speculative result.

When the phrase list is an empty set, the weight of the first matching degree is 0, and the weight of the second matching degree is 1;

Preferably, when there are at least two second words in the word list, the second matching degree is calculated by:

Preferably, the method for determining the matching degree of the second vocabulary through character similarity calculation includes:

Preferably, the weights of the second words are sorted according to the order in the input text, so that the weight occupied by the first second word is the largest, and the weights of the subsequent second words are gradually reduced, and the sum of the weights of all the second words is equal to 1.

Example 4:

the semantic inference method based on text similarity and keywords comprises the following specific processes:

1. processing the text in the database:

before inputting a text, processing data in a database, performing word segmentation processing on the text stored in the database, wherein the text may include phrases with more than 3 characters, the key description defined in the present scheme as a criticality description that may represent a text direction is stored as a key-value pair form, where the key is a phrase and the value is a position of the text that possesses the phrase in the database, and the key-value pair is stored in a database _ presentation _ vocal _ dic dictionary. The word segmentation result is two characters and the following key words defined as the inferred semantics in the scheme, and the inference is made by the following two characters and the following words on the basis of the previous phrase because the same text of the phrase may exist. And also stored in the database _ keyword _ dic dictionary in the form of key-value pairs. And for the key of database _ service _ vocal _ dictionary _ dic, since the key in the dictionary is 3 characters, the processing method at this time is that if the number of characters is singular, that is, the number of characters cannot be divided by 2, the central character is used as a boundary, the central character is the first half (including the central character) forward, the central character is the second half (including the central character) backward, and the values are respectively used as keys, and the values are the values for the keys in database _ service _ vocal _ dic. If the value is divisible by 2, the first half of the character is the first half, and the second half is the second half, which are used as keys, and the value is also the value of the corresponding key in database _ service _ volume _ dic. Added to the database _ keyword _ dic.

2. Detecting a user input:

when a user inputs, the input of the user is segmented through a segmentation library, the results of the segmentation are classified, phrases with 3 characters or more which can be considered as phrases which can indicate the large direction that the user wants to inquire are classified into professional _ vocal _ list, and the results of the segmentation with 2 characters or less are considered as words and key words which are used for describing the phrases in the professional _ vocal _ list. Classified in the keyword list.

3. Calculating the matching degree:

after the user input is processed, two lists, namely, professional _ vocal _ list and keyword _ list, are obtained, wherein the importance degree of the professional _ vocal _ list is higher than that of the keyword _ list for the two lists, and since the text with strong relevance is analyzed, the phrase with more than 3 characters in the professional _ vocal _ list is considered to indicate the important direction of the user input. So that the data in the list needs to be processed preferentially. For data in the professional _ vocal _ list, the first phrase is considered as the main phrase, for example: the C language program design is divided into C language and program design, C language is used as a main semantic presumption phrase, program design is used as an auxiliary, and the occupation ratio can be adjusted by self. The present case is set to have the second phrase at 40 percent of the first phrase and the subsequent phrase at 50 percent of the previous phrase, i.e., 1 if only one phrase is present, 64 if two, 622 if 3 phrases are present, 6211 if 4 phrases are present, and so on. And extracting values in the professional _ vocabulary _ list, matching in a phrase storage area processed by the database, recording as 1 if a direct matching item exists, and calculating according to the proportion if a plurality of phrases are to be matched in the list. The input phrase and the phrase in the database are not directly equal, so that the editing distance can be calculated by using a fuzzy matching method in the case, and the operation amount is reduced under the same deviation condition because the text is subjected to word segmentation processing and the length of the text is reduced. The present case proposes an optimization scheme based on the concept of the Jacard similarity coefficient to calculate the text similarity. For professional _ voacabusular _ list, where a phrase is present, the characters in the phrase are operated on, and each character is matched against the characters in the phrase in the database. And recording the matching success as 1, returning when the same character is matched, and performing matching operation on the next short sentence. And finally, after finishing all characters of the phrase in the professional _ vocal _ list, dividing the sum of the matching numbers of the characters by the length of the phrase input by the user to obtain the accurate similarity of the character operation. And taking the highest similarity to obtain the result which is closest to the user and is desired to be obtained. However, the similarity at this time needs to be calculated in the previous ratio. Taking the example that the professional _ vocabularies _ list has two phrases, the ratio is 64, if the first phrase cannot be directly matched and the highest similarity is 2 of 3, then 6 times 2 of 3 are needed in the ratio to obtain the final ratio of 4. The latter phrase can not be directly matched, and the same principle is applied. But if a direct match is possible then a 2 less proportion of the first phrase is added to the directly matchable phrase, i.e. if the latter phrase is directly matchable then the proportion becomes 46. Multiple phrases and so on. If there are multiple phrases that can be directly matched in the professional _ vocabularies _ list, then the reduced proportion is added to the first phrase that can be directly matched. If all phrases cannot be matched directly, then the reduced proportion is not processed. I.e. there may be cases where the ratio sum is not one. And processing the phrases to obtain the text with the highest similarity and the position of the text in the database. In this case, the unique text and the multiple texts are possible, and the unique text is not processed as a final result. And if the text exists, matching the keyword _ list. Since a target text with a higher matching degree is obtained by professional _ vocal _ list matching, we can also obtain 2 characters and the following words in the target text. The vocabulary input by the user is dematching through the keyword _ list, wherein the previous matching degree accounts for 6 of the final matching degree, and the keyword _ list accounts for 4 (if the professional _ vocabulary _ list is empty, the occupation ratio of the keyword _ list is 1). If there is only one data in the keyword _ list and it can be matched directly, the matching degree of the keyword _ list is 0.4, and the sum of 6 of the previous professional _ vocal _ list is added to obtain the final guess value. If there is more data in the keyword _ list, the calculation of the ratio is similar to the calculation of the ratio in the preceding professional _ scalable _ list, the latter being 50 percent of the preceding one. And if the characters cannot be directly matched, matching the characters in the keyword _ list. The method for matching the vocabulary in the keyword _ list is the same as the previous professional _ vocabulary _ list. And finally, adding the matching degrees of the two lists to obtain a final guess value, and taking the unique text or the multiple texts with the highest guess value as a result.

4. And (3) outputting:

after the text with the highest degree of speculation is obtained, the position of the text corresponding to the text in the database can be found, and then the text can be displayed to the user.

The present example is illustrated by the following example:

the same database was used in each of the following cases, the raw data in the database being: "C language programming 1A", "C language programming 2A", "discrete structure", "accounting", "engineering economics", "advanced mathematics", "meredes speed", "Java programming". Wherein fig. 2-10 are code run screenshots of the following scenarios, respectively.

Case 1:

inputting a text: c language

Target text: "C language programming 1A" and "C language programming 2A"

Firstly, performing word segmentation processing on an input text, wherein when the input text is in a C language, a word segmentation result is in the C language, and word segmentation results input by a user are also stored respectively. Since the word segmentation result only has "C language" and is greater than or equal to 3 characters, it is stored in userinput _ professional _ vocabularies _ list. For the word segmentation result of the text input by the user, which contains 3 characters or more, it is compared with the keys in the database processed database _ professional _ vocal _ dic. The "C language programming 1A" and the "C language programming 2A" are classified into C language, programming, 1A and C language, programming, 2A through data processing in the database. Therefore, the C language input by the user can be directly matched with the key in the database _ professional _ vocal _ dic, and since the word segmentation result in the user input does not include two characters and the following words, as shown in fig. 2, "C language programming 1A" and "C language programming 2A" are the final guess result.

In the estimation result shown in fig. 2, [ 0, 1 ] indicates the database positions corresponding to the "C language programming 1A" and the "C language programming 2A" in the data. The list above [ 0, 1 ] is the inferred degree of matching, and since there is only one word segmentation result input by the user and the matching can be directly performed, the highest value is inferred to be 1.

Case 2:

inputting a text: c language programming

Target text: "C language programming 1A" and "C language programming 2A"

The word segmentation processing is performed on the text input by the user, and the word segmentation processing can be classified into "C language" and "programming". The calculation rule is that the first phrase is regarded as the main phrase, and the calculation rule is 6, the second phrase is 4, if more than two phrases exist, the third phrase is half of the previous phrase from the 3 rd phrase, namely the ratio of 3 phrases to 622 is determined, and the ratio of 4 phrases to 6211 is determined, and so on. Since both the C language and the programming contain keys in the processed database that can be directly matched, the user-entered text can be directly matched.

Since the program design is not only "C language program design", a value of 0.4 is present in the degree of matching estimation, but since it does not include "C language", only the degree of matching estimation of "C language program design 1A" and "C language program design 2A" is 1, and is output as the highest value, as shown in fig. 3.

(third) case 3:

inputting a text: c language politics

Target text: "C language programming 1A" and "C language programming 2A"

Because the database has no corresponding key of "politics", and the key in the database is economics and accounting, the "politics" is processed, each character of "politics", "politics" and "mathematics" is operated to match with the key of database _ professional _ marketing _ dic in the incoming data, if the key at a certain position contains "politics", "politics" and "mathematics", the position is marked as 1, otherwise, the position is 0. Finally, the matching lists of 'politics', 'governance', 'learning' 3 characters at the corresponding positions can be obtained, the 3 lists are added, and the length of the key in the corresponding position database is divided to obtain a matching degree. The 'politics' and the 'politics' cannot be matched in the database, the 'mathematics' can be matched with the 'mathematics', the economics and the accounting, wherein only characters of the 'mathematics' in the text politics input by a user can be matched with the characters of the 'mathematics', the matching degree of three positions of the mathematics, the accounting and the high mathematics in the database is one third (the matching degree calculation method of the text corresponding to the position in the database is that the number of the characters which can be matched with the position in the text input by the user is divided by the total number of the characters input by the user), and the keys are updated into the userinput _ professional _ favorite _ list as new data. At this time, the weight ratio is changed, the ratio of the value which can not be directly matched needs to be multiplied by the highest matching degree of the ratio to be used as the final ratio, and if the userinput _ professional _ marketing _ list has a term which can be directly matched, the reduced value is added to the first term in the terms which can be directly matched. Since the matching degree of politics, accountants and higher mathematics is 1 to 3 and the original proportion is 4, the final proportion is 4 to 3, and since the C language can be directly matched and is the first phrase capable of being directly matched, the 3/8 weight reduced by the politics is added to the original weight 6 of the C language to serve as the final result of the proportion; the final output is shown in fig. 4.

(IV) case 4:

inputting a text: c language 1A

Target text: c language programming 1A

The method comprises the steps of performing word segmentation processing on a text input by a user, wherein word segmentation results of the user are 'C language' and '1A', the C language is stored into userinput _ professional _ vocabularies _ list as a phrase, 1A is stored into userinput _ keyword _ list as a word, the phrase is processed firstly at the moment, C language programming 1A and C language programming 2A are found through the C language, positions of the C language programming 1A and the C language programming 2A in a database are returned as results, and keys containing corresponding subscripts as values are found from databas _ keyword _ dic. These keys are the keys that userinput _ keyword _ list needs to match. If a plurality of words needing to be matched in the userinput _ keyword _ list also need to be calculated, the method for calculating the ratio is the same as that before. Since the user inputs the phrase and the word at this time, the phrase ratio is set to 6 and the word ratio is set to 4 in this embodiment, and since the "C language programming 2A" does not include the 1A word at this time, the final result is "C language programming 1A" as shown in fig. 5.

(V) case 5:

inputting a text: c language 2

Target text: c language programming 2A

Processing the text input by the user, wherein the word segmentation results in "C language" and "2", wherein the "C language" is stored as a phrase into userinput _ professional _ vocabularies _ list, and the "2" is stored as a vocabulary into userinput _ keyword _ list. At this time, the phrases are processed first, the "C language programming 1A" and the "C language programming 2A" are found through the "C language", and the positions of the "C language programming 1A" and the "C language programming 2A" in the database are returned as a result, and a key containing the corresponding subscript as a value is found from the database _ keyword _ dic. These keys are the keys that userinput _ key _ list needs to match. And the word "2" in the user input word can correspond to the word "2A" in the "C language programming 2A", the matched characters in the user input word are 1, the text length is 1, so the similarity of the user input word is calculated to be 1. At this time, since the user input text includes phrases and vocabularies, the phrase occupation ratio is 0.6, the vocabulary occupation ratio is 0.4, at this time, "C language programming 1A" can only match the phrases, and the final similarity is 0.6, and "C language programming 2A" can match the user input vocabularies, and as shown in fig. 6, at this time, the final similarity of the C language programming 2A is 1, and is output as the final result.

(sixth) case 6:

inputting a text: running toy

Target text: meissedess Benz

Processing a text input by a user, wherein a word segmentation result is a run, so that the text enters vocabulary key value pair matching, one datum in a database is a mersaredss run, and a word segmentation processing result is the mersaredus and the run, wherein the run is in the vocabulary key value pair and is embodied in the run vocabulary, and since the text input by the user is a run character and the matching is successful, the matching degree is calculated to be 1, so that the data in the database corresponding to the output run is the mersaredss run as shown in fig. 7.

(seventh) case 7:

inputting a text: BMW horse

Target text: is free of

Processing the text input by the user, obtaining a word segmentation result which is 'baby horse', entering vocabulary matching, wherein the matching degree of the baby character and the horse character is 0 because the two characters are not reflected in the vocabulary keys in the database, and outputting no data because the matching degree of all data in the database is 0 as shown in figure 8.

(eighth) case 8:

inputting a text: geomantic omen human feelings BMW

Target text: is free of

The method comprises the steps of processing a text input by a user, obtaining word segmentation results of 'wind-soil-human-state' and 'BMW', firstly entering phrase matching, wherein the matching degree of the four characters of wind, soil, human and state is 0 because the four characters of wind, soil, human and state are not reflected in phrase keys in a database, and the phrase is considered to be invalid because the matching degree of all data in the database is 0. At this time, in the vocabulary matching, the two characters of 'Bao' and 'Ma' in the vocabulary key value pair are not reflected, so that the matching degree is 0. Since the matching degrees of all the data in the database are 0, no data is output as shown in fig. 9.

(ninth) case 9:

inputting a text: wind, soil and human feeling speed

Target text: meissedess Benz

The user inputs the text for processing, the obtained word segmentation result is ' wind, soil, people's feelings ' and ' rushing ' and firstly enters phrase matching, because four characters of wind, soil, people and feelings are not reflected in phrase keys in the database, the matching degree of the four characters is 0 at this time, and because the matching degree of all data in the database is 0, the phrase is considered to be invalid at this time, the situation is considered to be the same as that of no phrase input, and the phrase is eliminated. At this time, in the vocabulary matching, the speed in the vocabulary key value pair can directly correspond to the speed of the text vocabulary of the mesiders speed in the database, so as shown in fig. 10, the result of the similarity of the vocabularies is 1, and the text "mesiders speed" is output.

(ninth) case 10:

inputting a text: c language BMW

Target text: "C language programming 1A" and "C language programming 2A"

Processing a text input by a user, wherein word segmentation results are 'C language' and 'BMW', firstly processing a phrase to obtain a text 'C language programming 1A' and 'C language programming 2A' matched with the phrase, at the moment, because the text input by the user has the phrase, corresponding words are found through values of the 'C language programming 1A' and the 'C language programming 2A' in a database, but because two characters of 'Bao' and 'Ma' are not embodied in a word key value pair, the word matching degree is 0 at this moment, and the final matching degree is determined by the phrase; as shown in fig. 11, the output results are "C language programming 1A" and "C language programming 2A".

Through the above embodiments, it can be seen that the method for detecting the text input by the user is adopted, word segmentation is performed on the user input, and the calculation amount of the edited text can be reduced to a certain extent for the existing technologies such as fuzzy matching. The waste of the operation amount caused by the condition from short text to long text is avoided; for the text which can not be processed by word segmentation, a mode of character-by-character matching is adopted to obtain whether each character is matched with each word, calculation of the Jacard similarity coefficient is refined and optimized, and another algorithm except an edit distance algorithm is provided when the similarity calculation is carried out on two words.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. In addition, the term "connected" used herein may be directly connected or indirectly connected via other components without being particularly described.

Claims

1. The semantic inference method based on text similarity and keywords is characterized by comprising the following steps:

if N is more than or equal to 3, storing the irreparable word in a phrase dictionary in a key-value pair mode; defining an irrevocable word in a phrase dictionary as a first phrase;

the method also comprises the following key value processing of the first phrase:

defining the first half character and the second half character:

respectively taking the front half part character and the rear half part character as keys, taking the value of the corresponding key in the phrase dictionary of the irreducible word segmentation as a value, and storing the value in the vocabulary dictionary in a key value pair mode;

if M is less than or equal to 2, storing the irreparable word in a vocabulary list in a key-value pair form; defining an irrevocable word in the vocabulary list as a second vocabulary;

if the first matching degree is less than 1, outputting one or more texts with the highest first matching degree as a conjecture result;

when there are a plurality of texts with the highest first matching degree, the method further includes:

s5, calculating a final matching degree (the first matching degree × the weight of the first matching degree) + (the second matching degree × the weight of the second matching degree);

s6, the final matching degree is used as the guess value, and one or more texts with the highest guess value are output as the guess result.

2. The semantic inference method based on text similarity and keywords according to claim 1, wherein in step S3, the method for calculating the first matching degree comprises:

s301, sequencing the second phrases according to the text sequence, and sequentially setting the weight of each second phrase to be A ₁ ，A ₂ ，…A _i Let A be ₁ Maximum; wherein i is a second phrase number; and when i is more than or equal to 3, A _i ≤A _(i-1) ；

3. The semantic inference method based on text similarity and keywords according to claim 2, wherein in step S302:

if the second phrase does not have a direct matching item in the phrase dictionary, character similarity calculation is adopted to determine the matching degree of the corresponding second phrase.

4. The semantic inference method based on text similarity and keywords according to claim 3, wherein the character similarity calculation comprises the following method:

s3022, dividing the number of successfully matched single characters by the total number of characters in the successfully matched second phrase to obtain the similarity of the second phrase relative to the corresponding first phrase, and extracting the maximum value of the similarity as the highest similarity;

s3023, updating the weight of the second phrase to A _i X highest similarity; wherein A is _i And the original weight corresponding to the second phrase.

5. The method of claim 4, wherein the calculating the first matching degree further comprises:

6. The semantic inference method based on text similarity and keywords according to claim 5,

7. The semantic inference method based on text similarity and keywords according to claim 6,

when at least two second words exist in the word list, calculating the second matching degree by the following method:

computing

8. The semantic inference method based on text similarity and keywords according to claim 7, wherein the method for determining the matching degree of the second vocabulary through character similarity calculation comprises:

extracting the highest similarity, and updating the weight of the second vocabulary to m _k X highest similarity; wherein m is _k The original weight corresponding to the second vocabulary;

if at least one second vocabulary has a direct matching item in the vocabulary dictionary, the weight difference reduced by all the second vocabularies without the direct matching item is added to the first second vocabulary with the direct matching item, and then the updated weights of all the second vocabularies are output, and the final matching degree is calculated.