CN117972025B

CN117972025B - Massive text retrieval matching method based on semantic analysis

Info

Publication number: CN117972025B
Application number: CN202410386961.4A
Authority: CN
Inventors: 董莎; 马成英; 严浩; 郑智剑; 叶名辰; 郑宗波; 徐芬; 李元丽
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2024-04-01
Filing date: 2024-04-01
Publication date: 2024-06-07
Anticipated expiration: 2044-04-01
Also published as: CN117972025A

Abstract

The invention belongs to the technical field of text retrieval matching, and particularly relates to a massive text retrieval matching method based on semantic analysis.

Description

Massive text retrieval matching method based on semantic analysis

Technical Field

Background

In today's society in which the internet and digitization technology are developing, a large amount of information is generated and accumulated in various fields, and it becomes difficult to find specific, useful information among so much information. Under such circumstances, the search platform is generated, however, with the increasing demands of users on information types and topics, the search platform needs to provide text data as diverse and comprehensive as possible, so as to generate massive text data, and the efficient search of information required by users in the massive text data becomes a core challenge of the search platform.

However, in the prior art, when dealing with efficient information retrieval of massive text data, the attention point usually falls on a text matching algorithm, and the aim is to improve the information retrieval efficiency by mining and optimizing the processing mode of the text matching algorithm; on the other hand, the information retrieval is an operation process, the use of a text matching algorithm is only an important step of the information retrieval process, the information retrieval further comprises preprocessing before text matching, and the text preprocessing removes useless information through cleaning, normalizing and converting operation on text data, so that the calculation amount and time consumption of subsequent text matching are reduced, the subsequent text matching step is tired once the text preprocessing process is slow, and especially under massive text data, so that the optimization of the visual text preprocessing efficiency is also important. Moreover, if only focusing on improving the retrieval efficiency through algorithm optimization, the efficiency of other retrieval steps is ignored, and comprehensive system improvement may not be realized.

In addition, in the prior art, when semantic similarity matching is performed on the preprocessed text data, a more general matching algorithm with high accuracy is generally selected uniformly, the text level type suitable for different matching algorithms is not considered, matching inadaptation is easily caused, a series of problems of reduced matching efficiency, inaccurate matching results and incomplete matching are caused, and then the matching effect is influenced.

Disclosure of Invention

Therefore, an object of the embodiment of the application is to provide a massive text retrieval matching method based on semantic analysis, which improves massive text retrieval efficiency by putting feet on text preprocessing and pertinently selects a text matching algorithm, thereby effectively solving the problems mentioned in the background art.

The aim of the invention can be achieved by the following technical scheme: a massive text retrieval matching method based on semantic analysis comprises the following steps: s1, receiving a search instruction, extracting a request problem currently input in a search platform, and extracting a subject term from the request problem.

S2, comparing the subjects of each text data mark stored in the associated knowledge base of the search platform with the subject terms of the request questions, and screening text data conforming to the subject terms of the request questions from the subjects as candidate text data.

S3, grouping the candidate text data to obtain a plurality of group sets, and carrying out common characteristic identification on each group set.

S4, preprocessing the alternative text data in the corresponding group sets according to the common characteristics corresponding to the group sets to obtain the processed alternative text data corresponding to the group sets.

And S5, obtaining content credibility, uploading time and historical access frequency corresponding to each candidate text data in each group of sets, and thus determining the text matching sequence of each group of sets.

And S6, carrying out hierarchical type analysis on the corresponding candidate text data in each group of sets to obtain the hierarchical type corresponding to each piece of candidate text data.

And S7, selecting an adaptation similarity algorithm based on the level type corresponding to each piece of candidate text data.

S8, sequentially calling the candidate text data in the corresponding group sets according to the text matching sequence of the group sets, and carrying out text matching on the called candidate text data with the request problem by utilizing an adaptation similarity algorithm to obtain the semantic similarity of the candidate text data.

S9, selecting a search result corresponding to the request problem based on the semantic similarity of each candidate text data in each group of sets, and outputting and displaying the search result in a search output frame.

As a further optimization of the above scheme, the implementation process of extracting the subject term of the request problem is as follows: and dividing the request problem into words, and removing stop words from the divided words to obtain a plurality of effective words.

And acquiring the part of speech of each effective word, and screening out the key word from each effective word according to the part of speech.

And comparing the key word segments, identifying whether repeated key word segments exist, and if so, performing de-duplication processing on the repeated key word segments to obtain the subject word.

As a further optimization of the above scheme, the grouping of the alternative text data is described in the following procedure: and acquiring the uploading time corresponding to each piece of alternative text data, and numbering the alternative text data according to the sequence of the uploading time from first to last.

And extracting the attribution field and the text length corresponding to each candidate text data.

Each candidate text data is subjected to sentence segmentation, whereby the number of segmented sentences is counted.

Classifying the candidate text data according to the same attribution field, the same text length and the same sentence number to obtain the candidate text data corresponding to the attribution fields, the text lengths and the sentence numbers.

Taking the attribution field, the text length and the sentence number as classification labels, counting the attribution field number, the text length number and the sentence number obtained by classification, and sequentially extracting the attribution field, the text length and the sentence number according to the classification labels to form a joint label set, so as to obtain a plurality of joint label sets.

And comparing the numbers of the candidate text data in the same joint label set, and extracting the candidate text data corresponding to the same number from the numbers to form a group set.

As a further optimization of the above scheme, the implementation process of the common feature identification on each group of sets is as follows: dividing each candidate text data in each group of sets into words, traversing the divided text by using the stop word list, counting the number of the stop words, and further carrying out proportional calculation on the number of the stop words and the divided total word number to obtain the corresponding stop word occupation value of each candidate text data.

Traversing text character strings of each candidate text data in each group of sets, calculating the number of symbols in the text, and calculating the proportion of the number of symbols to the length of the text to obtain the symbol occupation value corresponding to each candidate text data.

Part of speech tagging and root tagging are carried out on the segmented words divided by the alternative text data in each group set, and then the segmented words corresponding to the same root are classified to obtain a segmented word set corresponding to each root, and then part of speech corresponding to each segmented word in the segmented word set is compared, repeated parts of speech are de-duplicated to obtain the number of parts of speech in the segmented word set corresponding to each root, so that part of speech diversity corresponding to each alternative text data is calculatedIn the following、Respectively represent the first of the candidate text dataThe root corresponds to the part-of-speech number and the word segmentation number in the word segmentation set,The number of the root of the word is represented,，Representing the number of roots present in the segmented words of the candidate text data division,Representing natural constants.

And arranging the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity corresponding to each candidate text data in the same group of sets according to the sequence from big to small to obtain the label ordering corresponding to each candidate text data.

Comparing the label sequences corresponding to the candidate text data in each group of sets, summarizing the candidate text data corresponding to the same label sequences, and further taking the label sequence with the highest occurrence frequency as a common characteristic corresponding to each group of sets.

As a further optimization of the above scheme, the preprocessing of the candidate text data in the corresponding group sets according to the common characteristics corresponding to the group sets comprises the following operations: and determining the implementation sequence of the pretreatment flow corresponding to each group of sets according to the common characteristics corresponding to each group of sets, wherein the pretreatment flow comprises word deactivation, character removal and word drying.

And preprocessing the alternative text data in each group set according to the execution sequence of the preprocessing flow.

As a further optimization of the above scheme, the history access frequency obtaining process is as follows: comparing the uploading time of each alternative text data in each group of sets, and further obtaining the furthest uploading time and the current time to form an access period corresponding to each group of sets.

And counting the access frequency corresponding to each candidate text data in the access period corresponding to each group of sets, dividing the access frequency by the sum of the access frequencies corresponding to all the candidate text data in the group of sets, and obtaining the historical access frequency corresponding to each candidate text data.

As a further optimization of the above scheme, the determining the text matching sequence of each group set includes the following steps: and matching the attribution field corresponding to each candidate text data in each group of sets with the preset content credibility corresponding to each attribution field to obtain the content credibility corresponding to each candidate text data.

Comparing the uploading time corresponding to each candidate text data in each group of sets with the current time, and calculating the content timeliness corresponding to each candidate text data, wherein the specific expression is as follows。

Importing the content credibility, the content timeliness and the historical access frequency corresponding to each candidate text data in each group of sets into an evaluation expressionAnd obtaining the matching value degree corresponding to each candidate text data in each group of sets.

Substituting the matching value degree corresponding to each candidate text data in each group of sets into a formulaCalculating the matching dominance index corresponding to each group of setsIn the followingThe group set number is indicated and the group set number,，Represent the firstGroup set ofMatching value degrees corresponding to the candidate text data,The number of the alternative text data is represented,，Representing the number of alternative text data present within the group set,、Respectively represent the firstAnd the maximum matching value degree and the minimum matching value degree corresponding to the group set.

And arranging the sets according to the descending order of the matching dominance indexes to obtain the text matching order of the sets.

As a further optimization of the above scheme, the hierarchical type parsing is seen in the following steps: and counting the number of the word segmentation corresponding to each candidate text data in each group of sets, and calculating the word segmentation coverage.

And comparing the parts of speech of the continuous word segmentation according to the arrangement sequence of the word segmentation, thereby identifying sentence pattern structures, further counting the number of the identified sentence pattern structures, and carrying out sentence pattern coverage calculation.

And extracting typesetting format characteristics corresponding to each piece of alternative text data in each group of sets, thereby counting the number of typeset paragraphs and performing paragraph coverage calculation.

Word segmentation coverage, sentence coverage and paragraph coverage corresponding to the alternative text data are passed through an analysis modelObtaining the hierarchy type corresponding to the alternative text dataIn the model、、All of which represent the conditions of constraint,，，Wherein、、Respectively representing word segmentation coverage, sentence pattern coverage and paragraph coverage corresponding to the candidate text data,、、Respectively representing the pre-configured effective word segmentation coverage, the effective sentence pattern coverage and the effective paragraph coverage,、、Respectively, and, or, not.

As a further optimization of the above scheme, the selected adaptation similarity algorithm is implemented as follows: and when the hierarchy type corresponding to the candidate text data is word level, sentence level or document level, selecting a similarity algorithm corresponding to the corresponding hierarchy type as an adaptive similarity algorithm corresponding to the candidate text data.

And when the hierarchy type corresponding to the alternative text data is other, comparing similarity algorithms applicable to the word level, the sentence level and the document level, and selecting an overlapped similarity algorithm from the similarity algorithms as an adaptive similarity algorithm corresponding to the alternative text data.

As a further optimization of the above scheme, the search result corresponding to the selection request question is referred to the following process: and comparing the semantic similarity corresponding to each candidate text data in each group set with the set standard-reaching semantic similarity according to the text matching sequence of the group sets, and extracting the candidate text data as a retrieval result corresponding to the request problem once the semantic similarity corresponding to one candidate text data is greater than or equal to the standard-reaching semantic similarity.

By combining all the technical schemes, the invention has the following positive effects: 1. according to the invention, candidate text data conforming to the subject terms of the input request problem on the search platform are screened out from the associated knowledge base based on the subject terms of the input request problem, the candidate text data are grouped according to commonalities, and text preprocessing is carried out according to the commonalities, so that the mass text search efficiency is improved and falls on the text preprocessing, the information search efficiency is improved by improving the text preprocessing efficiency, the improvement bottleneck and the storage burden caused by the excessive emphasis algorithm optimization in the prior art are avoided, and the calculation and response time of the whole system can be reduced, the concurrency processing capacity and throughput of the system are improved by optimizing other components in the information search system, so that the information search is realized more rapidly and efficiently.

2. According to the text matching method, the hierarchical type analysis is carried out on the preprocessed text data, and then the adaptation similarity algorithm is selected according to the hierarchical type, so that the text data is subjected to semantic matching by the adaptation similarity algorithm, the targeted operation of text matching is realized, the algorithm with higher calculation effect can be selected according to the texts with different hierarchical types, the matching accuracy and adaptability can be improved, the matching efficiency and performance can be improved, and the consumption of calculation resources is reduced.

Drawings

The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.

FIG. 1 is a flow chart of the steps of the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, the invention provides a massive text retrieval matching method based on semantic analysis, which comprises the following steps: s1, receiving a search instruction, extracting a request problem currently input in a search platform, and extracting a subject term from the request problem.

The extraction process of the subject words is as follows: and dividing the request problem into words, and removing stop words from the divided words to obtain a plurality of effective words.

It should be noted that, the above-mentioned keyword is a word capable of expressing a subject, and generally, common parts of speech used for expressing the subject are nouns, verbs, proper nouns, and the like, and the keyword can be selected from the effective words by acquiring parts of speech of each effective word and matching the obtained part of speech with the common parts of speech used for expressing the subject.

In order to facilitate text matching, the text data meeting the corresponding theme can be rapidly screened out by performing theme index marking on the text data stored in the knowledge base.

Applying to the above embodiment, grouping the alternative text data is performed as follows: and acquiring the uploading time corresponding to each piece of alternative text data, and numbering the alternative text data according to the sequence of the uploading time from first to last.

The attribution field refers to a specific field or application scene related to the text, and common attribution fields include news reports, social media, academic research, laws and regulations and the like.

The text length refers to the number of characters or words contained in the text.

The method is characterized in that for the decommissioning words, the use frequency and the use sequence of the text data decommissioning words in the same attribution field can have the characteristics of the attribution field, for example, in news reports, the decommissioning words can be used less frequently, in social media texts, the decommissioning words can be used more frequently, the unification of the use characteristics of the decommissioning words in the text data can be ensured to the greatest extent by classifying the alternative text data in the same attribution field, and great convenience is provided for the decommissioning word processing of the text data.

For removing characters, the text data punctuation marks in the same attribution field have the characteristics of the attribution field, and the text data punctuation marks are classified in the same attribution field and the same sentence number by combining the sentence number, so that the unification of the character use characteristics in the text data can be ensured to the greatest extent.

For word drying, the word shape using rule of the text data in the same attribution field may have the characteristics of the attribution field, and the part of speech change may increase or decrease the text length, so that the attribution field is combined with the text length to classify the same attribution field and the same text length, and the unification of the word shape using characteristics in the text data can be ensured to the greatest extent.

In a preferred implementation, the number of the classified attribution fields, the number of the text lengths and the number of the sentences are A, B, C respectively, and the number of the combined tag sets isLet a=3, b=2, c=1, in this example the joint tag set is、、、、、Wherein、、In order to classify the respective areas of interest obtained,、The length of each text obtained for the classification.

The alternative data in the obtained group set has the same attribution field, text length and sentence number, and is further preprocessed according to the grouping, so that the scale of the preprocessed data can be reduced, similar texts are ensured to share the same preprocessing flow, repeated processing operation is avoided, the whole preprocessing process is accelerated, in addition, as the alternative text data stored in each group set contains common characteristics, the common characteristics can be extracted and shared, and the alternative text data are only stored once, and the same data are not required to be stored for each alternative text data independently, so that the occupation of a memory can be greatly reduced.

According to the invention, candidate text data conforming to the subject terms of the input request problem on the search platform are screened out from the associated knowledge base based on the subject terms of the input request problem, the candidate text data are grouped according to commonalities, and text preprocessing is carried out according to the commonalities, so that the mass text search efficiency is improved and falls on the text preprocessing, the information search efficiency is improved by improving the text preprocessing efficiency, the improvement bottleneck and the storage burden caused by the excessive emphasis algorithm optimization in the prior art are avoided, and the calculation and response time of the whole system can be reduced, the concurrency processing capacity and throughput of the system are improved by optimizing other components in the information search system, so that the information search is realized more rapidly and efficiently.

Further applied to the above embodiment, the implementation process of the common feature identification for each group set is as follows: dividing each candidate text data in each group of sets into words, traversing the divided text by using the stop word list, counting the number of the stop words, and further carrying out proportional calculation on the number of the stop words and the divided total word number to obtain the corresponding stop word occupation value of each candidate text data.

Traversing text character strings of each candidate text data in each group of sets, calculating the number of symbols in the text, and calculating the proportion of the number of symbols to the length of the text to obtain the symbol occupation value corresponding to each candidate text data. References herein to coincidences include, but are not limited to punctuation, emoticons, and the like.

It should be noted that the root word refers to a basic, non-subdividable portion of a vocabulary, which generally carries the main meaning and core concept of the vocabulary. In the process of forming the vocabulary, the root word can be combined with other affix (prefix, suffix) to form a complete word.

Generally, different parts of speech of a word have different parts of speech variation, such as adjectives, adverbs and nouns, but the more various parts of speech corresponding to the same root, the higher the degree of part of speech variation is likely to be, and the higher the requirement for word desiccation is.

It should be noted that, since the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity are all between 0 and 1, comparison can be performed.

In particular, the pretreatment is performed as follows: determining the implementation sequence of a preprocessing flow corresponding to each group set according to the common characteristics corresponding to each group set, wherein the preprocessing flow comprises word deactivation, character removal and word drying, and the word drying is to normalize different forms or variants of the words into the basic form. The purpose of word drying is to map words of similar semantics but in different forms onto the same stem, thereby reducing the variants of the vocabulary and simplifying the complexity of text processing and analysis.

In a specific example of the above scheme, assuming that the common characteristic of the group set is the stop word occupation ratio > part-of-speech diversity > symbol occupation ratio, the preprocessing flow is stop word removal, word drying and character removal.

After screening the candidate text data, the method and the device determine the preprocessing flow sequence according to the common characteristics of the group set, so that the higher useless information can be processed first, the processing efficiency can be improved, the noise interference can be reduced, the subsequent processing task can be simplified, the unnecessary information can be effectively removed, the meaningful characteristics are focused, and the accuracy and the efficiency of text processing can be improved.

In an improved implementation of the above scheme, the history access frequency acquisition process is as follows: comparing the uploading time of each alternative text data in each group of sets, and further obtaining the furthest uploading time and the current time to form an access period corresponding to each group of sets.

It should be noted that the more frequent historical access of text may mean that this text has attracted much attention and interest in the past, possibly because the text provides valuable information, a unique perspective, new findings, or reports of hot topics of public interest, possibly suggesting that it has some value.

In a further refinement, the text matching order of each group set is determined as follows: and matching the attribution field corresponding to each candidate text data in each group of sets with the preset content credibility corresponding to each attribution field to obtain the content credibility corresponding to each candidate text data.

It should be appreciated that, since different documents belonging to different domains may relate to different expertise and concepts in content, content embodying in some domains may be based on extensive research and reliable data support, such as scientific and other domains, and content in these domains is often subject to peer review and scientific research, thus having higher reliability in reliability. While other areas such as entertainment, social media, etc., content may be more personal based and subjective, with relatively low confidence.

Comparing the uploading time corresponding to each candidate text data in each group of sets with the current time, and calculating the content timeliness corresponding to each candidate text data, wherein the specific expression is as followsThe reference time period may be 2 days, and the closer the uploading time corresponding to the candidate text data is to the current time, the greater the timeliness of the content corresponding to the candidate text data.

Substituting the matching value degree corresponding to each candidate text data in each group of sets into a formulaCalculating the matching dominance index corresponding to each group of setsIn the followingThe group set number is indicated and the group set number,，Represent the firstGroup set ofMatching value degrees corresponding to the candidate text data,The number of the alternative text data is represented,，Representing the number of alternative text data present within the group set,、Respectively represent the firstThe corresponding maximum matching value degree and minimum matching value degree in the group set, wherein the smaller the matching value difference of each candidate text data in a certain group set is, the larger the matching value degree is, which indicates that the candidate text data in the group set not only has high matching value, but also has more stable matching value distribution, and further has larger matching advantage.

The sets are arranged in descending order according to the matching dominance indexes to obtain the text matching sequence of the sets, and the alternative text data contained in the sets with larger matching dominance often contains more reliable, more relevant and more valuable data, so that the alternative text data in the sets with larger matching dominance can be subjected to similarity matching preferentially, the number of texts needing to be matched can be reduced, the matching efficiency is improved, the consumption of time and resources is reduced, meanwhile, more similar text data can be found in a more rapid focusing manner, and the matching precision and accuracy are improved.

S6, carrying out hierarchical type analysis on the corresponding candidate text data in each group of sets to obtain hierarchical types corresponding to each candidate text data, wherein the hierarchical types comprise word levels, sentence levels, document levels and others, and the specific analysis is as follows: counting the number of the word segments corresponding to each candidate text data in each group set, wherein the word segments refer to word segments after stop words are removed, and performing word segment coverage calculation, specifically, the maximum number of the effective word segments divided by each text data in a search platform association knowledge base can be taken as the number of reference word segments, and the ratio calculation is performed on the number of the word segments corresponding to the candidate text data and the number of the reference word segments to obtain the word segment coverage.

It is to be added that when the sentence pattern structure in the text data is identified, whether the sentence mark, question mark, exclamation mark and other common sentence terminal symbols exist in the text is not simply observed as the identification basis, but some contents are considered to be not provided with sentence terminal symbols but possibly become a sentence pattern, so that according to the grammar rules followed by the sentence pattern structure, the main predicate structure and the use of modifier are included, and the main predicate generally has corresponding part of speech representation, such as name and pronoun corresponding to the main predicate, and the part of speech corresponding to the predicate is usually a verb, thereby observing the part of speech relationship between the words in the text, judging whether the grammar rules are met, realizing the comprehensive identification of the sentence pattern structure, and greatly avoiding omission. Wherein the sentence pattern coverage is calculated by the same principle as the word segmentation coverage.

Extracting typesetting format characteristics corresponding to each candidate text data in each group of sets, wherein the head line indentation is the typesetting characteristics of paragraphs in general, thus counting the number of typeset paragraphs, and performing paragraph coverage calculation, wherein the calculation of paragraph coverage is similar to that of word segmentation coverage.

In the above, the word segmentation coverage, the effective sentence pattern coverage and the paragraph coverage are compared with the corresponding maximum values in the calculation, so that the values of the word segmentation coverage, the effective sentence pattern coverage and the paragraph coverage are between 0 and 1, in this case、、The value of (2) may be 0.8.

It should be added that word-level text refers to a word sequence obtained by word segmentation of text. Generally, word-level text is typically composed of individual words or phrases, without complete sentence structure.

Sentence-level text refers to a sentence sequence obtained by sentence-dividing the text. At this level, the text is divided into complete sentences, each representing a semantically complete meaning unit. Sentence-level text has a definite grammatical structure and logical relationships, and sentences are interrelated to form a complete paragraph.

Document level text generally has a complete paragraph structure and has greater spread and consistency in content.

S7, selecting an adaptation similarity algorithm based on the level type corresponding to each piece of candidate text data, and implementing the following steps: and when the hierarchy type corresponding to the candidate text data is word level, sentence level or document level, selecting a similarity algorithm corresponding to the corresponding hierarchy type as an adaptive similarity algorithm corresponding to the candidate text data.

In one example, cosine similarity, edit distance, lexical similarity, etc. algorithms may be used for word-level text to calculate semantic similarity.

The semantic similarity may be calculated for sentence-level text using cosine similarity, jaccard similarity, or other algorithms.

Semantic similarity can be calculated for document level text using algorithms based on word bag models, TF-IDF, etc.

S9, selecting a search result corresponding to the request problem based on the semantic similarity of each candidate text data in each group of sets, and outputting and displaying the search result in a search output frame. Specifically, the search result corresponding to the selection request question is found in the following process: and comparing the semantic similarity corresponding to each candidate text data in each group set with the set standard-reaching semantic similarity according to the text matching sequence of the group sets, and extracting the candidate text data as a retrieval result corresponding to the request problem once the semantic similarity corresponding to one candidate text data is greater than or equal to the standard-reaching semantic similarity.

According to the text matching method, the hierarchical type analysis is carried out on the preprocessed text data, and then the adaptation similarity algorithm is selected according to the hierarchical type, so that the text data is subjected to semantic matching by the adaptation similarity algorithm, the targeted operation of text matching is realized, the algorithm with higher calculation effect can be selected according to the texts with different hierarchical types, the matching accuracy and adaptability can be improved, the matching efficiency and performance can be improved, and the consumption of calculation resources is reduced.

The foregoing is merely illustrative of the structures of this invention and various modifications, additions and substitutions for those skilled in the art of describing particular embodiments without departing from the structures of the invention or exceeding the scope of the invention as defined by the claims.

Claims

1. The massive text retrieval matching method based on semantic analysis is characterized by comprising the following steps of:

s1, receiving a search instruction, extracting a request problem currently input in a search platform, and extracting a subject term from the request problem;

S2, comparing the subjects of each text data mark stored in the associated knowledge base of the search platform with the subject words of the request questions, and screening text data conforming to the subject words of the request questions from the subjects as candidate text data;

S3, grouping the alternative text data to obtain a plurality of groups of sets, and carrying out common characteristic identification on each group of sets;

S4, preprocessing the alternative text data in the corresponding group sets according to the common characteristics corresponding to the group sets to obtain the processed alternative text data corresponding to the group sets;

S5, obtaining content credibility, uploading time and historical access frequency corresponding to each candidate text data in each group of sets, and thus determining the text matching sequence of each group of sets;

S6, carrying out hierarchical type analysis on the corresponding alternative text data in each group of sets to obtain hierarchical types corresponding to each piece of alternative text data;

s7, selecting an adaptation similarity algorithm based on the level type corresponding to each piece of alternative text data;

S8, sequentially calling the candidate text data in the corresponding group sets according to the text matching sequence of the group sets, and carrying out text matching on the called candidate text data with the request problem by utilizing an adaptation similarity algorithm to obtain the semantic similarity of each candidate text data;

s9, selecting a search result corresponding to the request problem based on the semantic similarity of each candidate text data in each group of sets, and outputting and displaying the search result in a search output frame;

the grouping of the alternative text data is described in the following procedure:

acquiring uploading time corresponding to each piece of alternative text data, and numbering the alternative text data according to the sequence of the uploading time from first to last;

Extracting the attribution field and the text length corresponding to each candidate text data;

performing sentence segmentation on each candidate text data, and counting the number of segmented sentences;

classifying the candidate text data according to the same attribution field, the same text length and the same sentence number to obtain candidate text data corresponding to a plurality of attribution fields, a plurality of text lengths and a plurality of sentence numbers;

Taking the attribution field, the text length and the sentence number as classification labels, counting the attribution field number, the text length number and the sentence number obtained by classification, and sequentially extracting the attribution field, the text length and the sentence number according to the classification labels to form a joint label set, so as to obtain a plurality of joint label sets;

Comparing the numbers of the candidate text data in the same joint label set, and extracting the candidate text data corresponding to the same numbers from the numbers to form a group set;

the implementation process of the common characteristic identification for each group of sets is as follows:

Dividing each candidate text data in each group of sets into words, traversing the divided text by using a stop word list, counting the number of stop words, and further carrying out proportional calculation on the number of stop words and the divided total word number to obtain a stop word occupation ratio corresponding to each candidate text data;

traversing text character strings of each candidate text data in each group of sets, calculating the number of symbols in the text, and calculating the proportion of the number of symbols to the length of the text to obtain symbol occupation values corresponding to each candidate text data;

Part of speech tagging and root tagging are carried out on the segmented words divided by the alternative text data in each group set, and then the segmented words corresponding to the same root are classified to obtain a segmented word set corresponding to each root, and then part of speech corresponding to each segmented word in the segmented word set is compared, repeated parts of speech are de-duplicated to obtain the number of parts of speech in the segmented word set corresponding to each root, so that part of speech diversity corresponding to each alternative text data is calculated In/>、/>Respectively represent the/>, in the candidate text dataThe root corresponds to the part of speech quantity, word segmentation quantity, and/or the word segmentation quantity in the word segmentation setRepresenting root number,/>，/>Representing the number of root words present in the segmented words of the candidate text data division,/>Representing natural constants;

arranging the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity corresponding to each candidate text data in the same group of sets according to the sequence from big to small to obtain label ordering corresponding to each candidate text data;

Comparing the label sequences corresponding to the alternative text data in each group of sets, summarizing the alternative text data corresponding to the same label sequences, and further taking the label sequence with the highest occurrence frequency as a common characteristic corresponding to each group of sets;

The following operation is carried out on the alternative text data in the corresponding group set according to the common characteristics corresponding to the group set:

determining the implementation sequence of a pretreatment flow corresponding to each group of sets according to the common characteristics corresponding to each group of sets, wherein the pretreatment flow comprises word deactivation, character removal and word drying;

preprocessing the alternative text data in each group set according to the execution sequence of the preprocessing flow;

The hierarchical type parsing is described in the following steps:

Counting the number of word segmentation corresponding to each candidate text data in each group of sets, and calculating word segmentation coverage;

Comparing the parts of speech of the continuous word segmentation according to the arrangement sequence of the word segmentation, thereby identifying sentence pattern structures, further counting the number of the identified sentence pattern structures, and carrying out sentence pattern coverage calculation;

Extracting typesetting format characteristics corresponding to each candidate text data in each group of sets, thereby counting the number of typeset paragraphs and performing paragraph coverage calculation;

word segmentation coverage, sentence coverage and paragraph coverage corresponding to the alternative text data are passed through an analysis model Obtaining the hierarchy type/>, corresponding to the candidate text dataIn the model/>、/>、/>All represent constraint conditions,/>，/>，Wherein/>、/>、/>Respectively represent word segmentation coverage, sentence coverage and paragraph coverage corresponding to the candidate text data,/>、/>、/>Respectively represent the pre-configured effective word segmentation coverage, effective sentence pattern coverage and effective paragraph coverage,/>、/>、/>Respectively, and, or, not.

2. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the implementation process of extracting the subject term of the request problem is as follows:

dividing the request problem into words, and removing stop words from the divided words to obtain a plurality of effective words;

acquiring the part of speech of each effective word, and screening out key word from each effective word according to the part of speech;

3. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the history access frequency obtaining process is as follows:

Comparing the uploading time of each alternative text data in each group of sets, and further taking the furthest uploading time and the current time to form an access period corresponding to each group of sets;

4. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the text matching sequence of each group set is determined as follows:

matching the attribution field corresponding to each candidate text data in each group of sets with the preset content credibility corresponding to each attribution field to obtain the content credibility corresponding to each candidate text data;

comparing the uploading time corresponding to each candidate text data in each group of sets with the current time, and calculating the content timeliness corresponding to each candidate text data, wherein the specific expression is as follows ；

Importing the content credibility, the content timeliness and the historical access frequency corresponding to each candidate text data in each group of sets into an evaluation expressionObtaining matching value degrees corresponding to the candidate text data in each group of sets;

substituting the matching value degree corresponding to each candidate text data in each group of sets into a formula Calculating matching dominance index/>, corresponding to each group of setsIn/>Representing group set number,/>，/>Represents the/>Within group set/>Matching value degree corresponding to candidate text data,/>Representing alternative text data numbers,/>，/>Representing the number of alternative text data present within the group set,/>、Respectively represent the/>Maximum matching value and minimum matching value corresponding to the group set;

5. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the selection adaptation similarity algorithm is implemented as follows:

when the hierarchy type corresponding to the candidate text data is word level, sentence level or document level, selecting a similarity algorithm corresponding to the corresponding hierarchy type as an adaptation similarity algorithm corresponding to the candidate text data;

6. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the search result corresponding to the selection request question is found in the following process:

And comparing the semantic similarity corresponding to each candidate text data in each group set with the set standard-reaching semantic similarity according to the text matching sequence of the group sets, and extracting the candidate text data as a retrieval result corresponding to the request problem once the semantic similarity corresponding to one candidate text data is greater than or equal to the standard-reaching semantic similarity.