CN117972025B - Massive text retrieval matching method based on semantic analysis - Google Patents
Massive text retrieval matching method based on semantic analysis Download PDFInfo
- Publication number
- CN117972025B CN117972025B CN202410386961.4A CN202410386961A CN117972025B CN 117972025 B CN117972025 B CN 117972025B CN 202410386961 A CN202410386961 A CN 202410386961A CN 117972025 B CN117972025 B CN 117972025B
- Authority
- CN
- China
- Prior art keywords
- text data
- group
- sets
- candidate text
- word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000004458 analytical method Methods 0.000 title claims abstract description 23
- 230000011218 segmentation Effects 0.000 claims description 34
- 238000007781 pre-processing Methods 0.000 claims description 27
- 238000004364 calculation method Methods 0.000 claims description 20
- 230000008569 process Effects 0.000 claims description 16
- 230000006978 adaptation Effects 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 12
- 238000001035 drying Methods 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 7
- 230000003044 adaptive effect Effects 0.000 claims description 5
- 230000009849 deactivation Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000005457 optimization Methods 0.000 description 13
- 230000000694 effects Effects 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000007373 indentation Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the technical field of text retrieval matching, and particularly relates to a massive text retrieval matching method based on semantic analysis.
Description
Technical Field
The invention belongs to the technical field of text retrieval matching, and particularly relates to a massive text retrieval matching method based on semantic analysis.
Background
In today's society in which the internet and digitization technology are developing, a large amount of information is generated and accumulated in various fields, and it becomes difficult to find specific, useful information among so much information. Under such circumstances, the search platform is generated, however, with the increasing demands of users on information types and topics, the search platform needs to provide text data as diverse and comprehensive as possible, so as to generate massive text data, and the efficient search of information required by users in the massive text data becomes a core challenge of the search platform.
However, in the prior art, when dealing with efficient information retrieval of massive text data, the attention point usually falls on a text matching algorithm, and the aim is to improve the information retrieval efficiency by mining and optimizing the processing mode of the text matching algorithm; on the other hand, the information retrieval is an operation process, the use of a text matching algorithm is only an important step of the information retrieval process, the information retrieval further comprises preprocessing before text matching, and the text preprocessing removes useless information through cleaning, normalizing and converting operation on text data, so that the calculation amount and time consumption of subsequent text matching are reduced, the subsequent text matching step is tired once the text preprocessing process is slow, and especially under massive text data, so that the optimization of the visual text preprocessing efficiency is also important. Moreover, if only focusing on improving the retrieval efficiency through algorithm optimization, the efficiency of other retrieval steps is ignored, and comprehensive system improvement may not be realized.
In addition, in the prior art, when semantic similarity matching is performed on the preprocessed text data, a more general matching algorithm with high accuracy is generally selected uniformly, the text level type suitable for different matching algorithms is not considered, matching inadaptation is easily caused, a series of problems of reduced matching efficiency, inaccurate matching results and incomplete matching are caused, and then the matching effect is influenced.
Disclosure of Invention
Therefore, an object of the embodiment of the application is to provide a massive text retrieval matching method based on semantic analysis, which improves massive text retrieval efficiency by putting feet on text preprocessing and pertinently selects a text matching algorithm, thereby effectively solving the problems mentioned in the background art.
The aim of the invention can be achieved by the following technical scheme: a massive text retrieval matching method based on semantic analysis comprises the following steps: s1, receiving a search instruction, extracting a request problem currently input in a search platform, and extracting a subject term from the request problem.
S2, comparing the subjects of each text data mark stored in the associated knowledge base of the search platform with the subject terms of the request questions, and screening text data conforming to the subject terms of the request questions from the subjects as candidate text data.
S3, grouping the candidate text data to obtain a plurality of group sets, and carrying out common characteristic identification on each group set.
S4, preprocessing the alternative text data in the corresponding group sets according to the common characteristics corresponding to the group sets to obtain the processed alternative text data corresponding to the group sets.
And S5, obtaining content credibility, uploading time and historical access frequency corresponding to each candidate text data in each group of sets, and thus determining the text matching sequence of each group of sets.
And S6, carrying out hierarchical type analysis on the corresponding candidate text data in each group of sets to obtain the hierarchical type corresponding to each piece of candidate text data.
And S7, selecting an adaptation similarity algorithm based on the level type corresponding to each piece of candidate text data.
S8, sequentially calling the candidate text data in the corresponding group sets according to the text matching sequence of the group sets, and carrying out text matching on the called candidate text data with the request problem by utilizing an adaptation similarity algorithm to obtain the semantic similarity of the candidate text data.
S9, selecting a search result corresponding to the request problem based on the semantic similarity of each candidate text data in each group of sets, and outputting and displaying the search result in a search output frame.
As a further optimization of the above scheme, the implementation process of extracting the subject term of the request problem is as follows: and dividing the request problem into words, and removing stop words from the divided words to obtain a plurality of effective words.
And acquiring the part of speech of each effective word, and screening out the key word from each effective word according to the part of speech.
And comparing the key word segments, identifying whether repeated key word segments exist, and if so, performing de-duplication processing on the repeated key word segments to obtain the subject word.
As a further optimization of the above scheme, the grouping of the alternative text data is described in the following procedure: and acquiring the uploading time corresponding to each piece of alternative text data, and numbering the alternative text data according to the sequence of the uploading time from first to last.
And extracting the attribution field and the text length corresponding to each candidate text data.
Each candidate text data is subjected to sentence segmentation, whereby the number of segmented sentences is counted.
Classifying the candidate text data according to the same attribution field, the same text length and the same sentence number to obtain the candidate text data corresponding to the attribution fields, the text lengths and the sentence numbers.
Taking the attribution field, the text length and the sentence number as classification labels, counting the attribution field number, the text length number and the sentence number obtained by classification, and sequentially extracting the attribution field, the text length and the sentence number according to the classification labels to form a joint label set, so as to obtain a plurality of joint label sets.
And comparing the numbers of the candidate text data in the same joint label set, and extracting the candidate text data corresponding to the same number from the numbers to form a group set.
As a further optimization of the above scheme, the implementation process of the common feature identification on each group of sets is as follows: dividing each candidate text data in each group of sets into words, traversing the divided text by using the stop word list, counting the number of the stop words, and further carrying out proportional calculation on the number of the stop words and the divided total word number to obtain the corresponding stop word occupation value of each candidate text data.
Traversing text character strings of each candidate text data in each group of sets, calculating the number of symbols in the text, and calculating the proportion of the number of symbols to the length of the text to obtain the symbol occupation value corresponding to each candidate text data.
Part of speech tagging and root tagging are carried out on the segmented words divided by the alternative text data in each group set, and then the segmented words corresponding to the same root are classified to obtain a segmented word set corresponding to each root, and then part of speech corresponding to each segmented word in the segmented word set is compared, repeated parts of speech are de-duplicated to obtain the number of parts of speech in the segmented word set corresponding to each root, so that part of speech diversity corresponding to each alternative text data is calculatedIn the following、Respectively represent the first of the candidate text dataThe root corresponds to the part-of-speech number and the word segmentation number in the word segmentation set,The number of the root of the word is represented,,Representing the number of roots present in the segmented words of the candidate text data division,Representing natural constants.
And arranging the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity corresponding to each candidate text data in the same group of sets according to the sequence from big to small to obtain the label ordering corresponding to each candidate text data.
Comparing the label sequences corresponding to the candidate text data in each group of sets, summarizing the candidate text data corresponding to the same label sequences, and further taking the label sequence with the highest occurrence frequency as a common characteristic corresponding to each group of sets.
As a further optimization of the above scheme, the preprocessing of the candidate text data in the corresponding group sets according to the common characteristics corresponding to the group sets comprises the following operations: and determining the implementation sequence of the pretreatment flow corresponding to each group of sets according to the common characteristics corresponding to each group of sets, wherein the pretreatment flow comprises word deactivation, character removal and word drying.
And preprocessing the alternative text data in each group set according to the execution sequence of the preprocessing flow.
As a further optimization of the above scheme, the history access frequency obtaining process is as follows: comparing the uploading time of each alternative text data in each group of sets, and further obtaining the furthest uploading time and the current time to form an access period corresponding to each group of sets.
And counting the access frequency corresponding to each candidate text data in the access period corresponding to each group of sets, dividing the access frequency by the sum of the access frequencies corresponding to all the candidate text data in the group of sets, and obtaining the historical access frequency corresponding to each candidate text data.
As a further optimization of the above scheme, the determining the text matching sequence of each group set includes the following steps: and matching the attribution field corresponding to each candidate text data in each group of sets with the preset content credibility corresponding to each attribution field to obtain the content credibility corresponding to each candidate text data.
Comparing the uploading time corresponding to each candidate text data in each group of sets with the current time, and calculating the content timeliness corresponding to each candidate text data, wherein the specific expression is as follows。
Importing the content credibility, the content timeliness and the historical access frequency corresponding to each candidate text data in each group of sets into an evaluation expressionAnd obtaining the matching value degree corresponding to each candidate text data in each group of sets.
Substituting the matching value degree corresponding to each candidate text data in each group of sets into a formulaCalculating the matching dominance index corresponding to each group of setsIn the followingThe group set number is indicated and the group set number,,Represent the firstGroup set ofMatching value degrees corresponding to the candidate text data,The number of the alternative text data is represented,,Representing the number of alternative text data present within the group set,、Respectively represent the firstAnd the maximum matching value degree and the minimum matching value degree corresponding to the group set.
And arranging the sets according to the descending order of the matching dominance indexes to obtain the text matching order of the sets.
As a further optimization of the above scheme, the hierarchical type parsing is seen in the following steps: and counting the number of the word segmentation corresponding to each candidate text data in each group of sets, and calculating the word segmentation coverage.
And comparing the parts of speech of the continuous word segmentation according to the arrangement sequence of the word segmentation, thereby identifying sentence pattern structures, further counting the number of the identified sentence pattern structures, and carrying out sentence pattern coverage calculation.
And extracting typesetting format characteristics corresponding to each piece of alternative text data in each group of sets, thereby counting the number of typeset paragraphs and performing paragraph coverage calculation.
Word segmentation coverage, sentence coverage and paragraph coverage corresponding to the alternative text data are passed through an analysis modelObtaining the hierarchy type corresponding to the alternative text dataIn the model、、All of which represent the conditions of constraint,,,Wherein、、Respectively representing word segmentation coverage, sentence pattern coverage and paragraph coverage corresponding to the candidate text data,、、Respectively representing the pre-configured effective word segmentation coverage, the effective sentence pattern coverage and the effective paragraph coverage,、、Respectively, and, or, not.
As a further optimization of the above scheme, the selected adaptation similarity algorithm is implemented as follows: and when the hierarchy type corresponding to the candidate text data is word level, sentence level or document level, selecting a similarity algorithm corresponding to the corresponding hierarchy type as an adaptive similarity algorithm corresponding to the candidate text data.
And when the hierarchy type corresponding to the alternative text data is other, comparing similarity algorithms applicable to the word level, the sentence level and the document level, and selecting an overlapped similarity algorithm from the similarity algorithms as an adaptive similarity algorithm corresponding to the alternative text data.
As a further optimization of the above scheme, the search result corresponding to the selection request question is referred to the following process: and comparing the semantic similarity corresponding to each candidate text data in each group set with the set standard-reaching semantic similarity according to the text matching sequence of the group sets, and extracting the candidate text data as a retrieval result corresponding to the request problem once the semantic similarity corresponding to one candidate text data is greater than or equal to the standard-reaching semantic similarity.
By combining all the technical schemes, the invention has the following positive effects: 1. according to the invention, candidate text data conforming to the subject terms of the input request problem on the search platform are screened out from the associated knowledge base based on the subject terms of the input request problem, the candidate text data are grouped according to commonalities, and text preprocessing is carried out according to the commonalities, so that the mass text search efficiency is improved and falls on the text preprocessing, the information search efficiency is improved by improving the text preprocessing efficiency, the improvement bottleneck and the storage burden caused by the excessive emphasis algorithm optimization in the prior art are avoided, and the calculation and response time of the whole system can be reduced, the concurrency processing capacity and throughput of the system are improved by optimizing other components in the information search system, so that the information search is realized more rapidly and efficiently.
2. According to the text matching method, the hierarchical type analysis is carried out on the preprocessed text data, and then the adaptation similarity algorithm is selected according to the hierarchical type, so that the text data is subjected to semantic matching by the adaptation similarity algorithm, the targeted operation of text matching is realized, the algorithm with higher calculation effect can be selected according to the texts with different hierarchical types, the matching accuracy and adaptability can be improved, the matching efficiency and performance can be improved, and the consumption of calculation resources is reduced.
Drawings
The invention will be further described with reference to the accompanying drawings, in which embodiments do not constitute any limitation of the invention, and other drawings can be obtained by one of ordinary skill in the art without inventive effort from the following drawings.
FIG. 1 is a flow chart of the steps of the method of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, the invention provides a massive text retrieval matching method based on semantic analysis, which comprises the following steps: s1, receiving a search instruction, extracting a request problem currently input in a search platform, and extracting a subject term from the request problem.
The extraction process of the subject words is as follows: and dividing the request problem into words, and removing stop words from the divided words to obtain a plurality of effective words.
And acquiring the part of speech of each effective word, and screening out the key word from each effective word according to the part of speech.
It should be noted that, the above-mentioned keyword is a word capable of expressing a subject, and generally, common parts of speech used for expressing the subject are nouns, verbs, proper nouns, and the like, and the keyword can be selected from the effective words by acquiring parts of speech of each effective word and matching the obtained part of speech with the common parts of speech used for expressing the subject.
And comparing the key word segments, identifying whether repeated key word segments exist, and if so, performing de-duplication processing on the repeated key word segments to obtain the subject word.
S2, comparing the subjects of each text data mark stored in the associated knowledge base of the search platform with the subject terms of the request questions, and screening text data conforming to the subject terms of the request questions from the subjects as candidate text data.
In order to facilitate text matching, the text data meeting the corresponding theme can be rapidly screened out by performing theme index marking on the text data stored in the knowledge base.
S3, grouping the candidate text data to obtain a plurality of group sets, and carrying out common characteristic identification on each group set.
Applying to the above embodiment, grouping the alternative text data is performed as follows: and acquiring the uploading time corresponding to each piece of alternative text data, and numbering the alternative text data according to the sequence of the uploading time from first to last.
And extracting the attribution field and the text length corresponding to each candidate text data.
The attribution field refers to a specific field or application scene related to the text, and common attribution fields include news reports, social media, academic research, laws and regulations and the like.
The text length refers to the number of characters or words contained in the text.
Each candidate text data is subjected to sentence segmentation, whereby the number of segmented sentences is counted.
Classifying the candidate text data according to the same attribution field, the same text length and the same sentence number to obtain the candidate text data corresponding to the attribution fields, the text lengths and the sentence numbers.
The method is characterized in that for the decommissioning words, the use frequency and the use sequence of the text data decommissioning words in the same attribution field can have the characteristics of the attribution field, for example, in news reports, the decommissioning words can be used less frequently, in social media texts, the decommissioning words can be used more frequently, the unification of the use characteristics of the decommissioning words in the text data can be ensured to the greatest extent by classifying the alternative text data in the same attribution field, and great convenience is provided for the decommissioning word processing of the text data.
For removing characters, the text data punctuation marks in the same attribution field have the characteristics of the attribution field, and the text data punctuation marks are classified in the same attribution field and the same sentence number by combining the sentence number, so that the unification of the character use characteristics in the text data can be ensured to the greatest extent.
For word drying, the word shape using rule of the text data in the same attribution field may have the characteristics of the attribution field, and the part of speech change may increase or decrease the text length, so that the attribution field is combined with the text length to classify the same attribution field and the same text length, and the unification of the word shape using characteristics in the text data can be ensured to the greatest extent.
Taking the attribution field, the text length and the sentence number as classification labels, counting the attribution field number, the text length number and the sentence number obtained by classification, and sequentially extracting the attribution field, the text length and the sentence number according to the classification labels to form a joint label set, so as to obtain a plurality of joint label sets.
In a preferred implementation, the number of the classified attribution fields, the number of the text lengths and the number of the sentences are A, B, C respectively, and the number of the combined tag sets isLet a=3, b=2, c=1, in this example the joint tag set is、、、、、Wherein、、In order to classify the respective areas of interest obtained,、The length of each text obtained for the classification.
And comparing the numbers of the candidate text data in the same joint label set, and extracting the candidate text data corresponding to the same number from the numbers to form a group set.
The alternative data in the obtained group set has the same attribution field, text length and sentence number, and is further preprocessed according to the grouping, so that the scale of the preprocessed data can be reduced, similar texts are ensured to share the same preprocessing flow, repeated processing operation is avoided, the whole preprocessing process is accelerated, in addition, as the alternative text data stored in each group set contains common characteristics, the common characteristics can be extracted and shared, and the alternative text data are only stored once, and the same data are not required to be stored for each alternative text data independently, so that the occupation of a memory can be greatly reduced.
According to the invention, candidate text data conforming to the subject terms of the input request problem on the search platform are screened out from the associated knowledge base based on the subject terms of the input request problem, the candidate text data are grouped according to commonalities, and text preprocessing is carried out according to the commonalities, so that the mass text search efficiency is improved and falls on the text preprocessing, the information search efficiency is improved by improving the text preprocessing efficiency, the improvement bottleneck and the storage burden caused by the excessive emphasis algorithm optimization in the prior art are avoided, and the calculation and response time of the whole system can be reduced, the concurrency processing capacity and throughput of the system are improved by optimizing other components in the information search system, so that the information search is realized more rapidly and efficiently.
Further applied to the above embodiment, the implementation process of the common feature identification for each group set is as follows: dividing each candidate text data in each group of sets into words, traversing the divided text by using the stop word list, counting the number of the stop words, and further carrying out proportional calculation on the number of the stop words and the divided total word number to obtain the corresponding stop word occupation value of each candidate text data.
Traversing text character strings of each candidate text data in each group of sets, calculating the number of symbols in the text, and calculating the proportion of the number of symbols to the length of the text to obtain the symbol occupation value corresponding to each candidate text data. References herein to coincidences include, but are not limited to punctuation, emoticons, and the like.
Part of speech tagging and root tagging are carried out on the segmented words divided by the alternative text data in each group set, and then the segmented words corresponding to the same root are classified to obtain a segmented word set corresponding to each root, and then part of speech corresponding to each segmented word in the segmented word set is compared, repeated parts of speech are de-duplicated to obtain the number of parts of speech in the segmented word set corresponding to each root, so that part of speech diversity corresponding to each alternative text data is calculatedIn the following、Respectively represent the first of the candidate text dataThe root corresponds to the part-of-speech number and the word segmentation number in the word segmentation set,The number of the root of the word is represented,,Representing the number of roots present in the segmented words of the candidate text data division,Representing natural constants.
It should be noted that the root word refers to a basic, non-subdividable portion of a vocabulary, which generally carries the main meaning and core concept of the vocabulary. In the process of forming the vocabulary, the root word can be combined with other affix (prefix, suffix) to form a complete word.
Generally, different parts of speech of a word have different parts of speech variation, such as adjectives, adverbs and nouns, but the more various parts of speech corresponding to the same root, the higher the degree of part of speech variation is likely to be, and the higher the requirement for word desiccation is.
And arranging the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity corresponding to each candidate text data in the same group of sets according to the sequence from big to small to obtain the label ordering corresponding to each candidate text data.
It should be noted that, since the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity are all between 0 and 1, comparison can be performed.
Comparing the label sequences corresponding to the candidate text data in each group of sets, summarizing the candidate text data corresponding to the same label sequences, and further taking the label sequence with the highest occurrence frequency as a common characteristic corresponding to each group of sets.
S4, preprocessing the alternative text data in the corresponding group sets according to the common characteristics corresponding to the group sets to obtain the processed alternative text data corresponding to the group sets.
In particular, the pretreatment is performed as follows: determining the implementation sequence of a preprocessing flow corresponding to each group set according to the common characteristics corresponding to each group set, wherein the preprocessing flow comprises word deactivation, character removal and word drying, and the word drying is to normalize different forms or variants of the words into the basic form. The purpose of word drying is to map words of similar semantics but in different forms onto the same stem, thereby reducing the variants of the vocabulary and simplifying the complexity of text processing and analysis.
In a specific example of the above scheme, assuming that the common characteristic of the group set is the stop word occupation ratio > part-of-speech diversity > symbol occupation ratio, the preprocessing flow is stop word removal, word drying and character removal.
And preprocessing the alternative text data in each group set according to the execution sequence of the preprocessing flow.
After screening the candidate text data, the method and the device determine the preprocessing flow sequence according to the common characteristics of the group set, so that the higher useless information can be processed first, the processing efficiency can be improved, the noise interference can be reduced, the subsequent processing task can be simplified, the unnecessary information can be effectively removed, the meaningful characteristics are focused, and the accuracy and the efficiency of text processing can be improved.
And S5, obtaining content credibility, uploading time and historical access frequency corresponding to each candidate text data in each group of sets, and thus determining the text matching sequence of each group of sets.
In an improved implementation of the above scheme, the history access frequency acquisition process is as follows: comparing the uploading time of each alternative text data in each group of sets, and further obtaining the furthest uploading time and the current time to form an access period corresponding to each group of sets.
And counting the access frequency corresponding to each candidate text data in the access period corresponding to each group of sets, dividing the access frequency by the sum of the access frequencies corresponding to all the candidate text data in the group of sets, and obtaining the historical access frequency corresponding to each candidate text data.
It should be noted that the more frequent historical access of text may mean that this text has attracted much attention and interest in the past, possibly because the text provides valuable information, a unique perspective, new findings, or reports of hot topics of public interest, possibly suggesting that it has some value.
In a further refinement, the text matching order of each group set is determined as follows: and matching the attribution field corresponding to each candidate text data in each group of sets with the preset content credibility corresponding to each attribution field to obtain the content credibility corresponding to each candidate text data.
It should be appreciated that, since different documents belonging to different domains may relate to different expertise and concepts in content, content embodying in some domains may be based on extensive research and reliable data support, such as scientific and other domains, and content in these domains is often subject to peer review and scientific research, thus having higher reliability in reliability. While other areas such as entertainment, social media, etc., content may be more personal based and subjective, with relatively low confidence.
Comparing the uploading time corresponding to each candidate text data in each group of sets with the current time, and calculating the content timeliness corresponding to each candidate text data, wherein the specific expression is as followsThe reference time period may be 2 days, and the closer the uploading time corresponding to the candidate text data is to the current time, the greater the timeliness of the content corresponding to the candidate text data.
Importing the content credibility, the content timeliness and the historical access frequency corresponding to each candidate text data in each group of sets into an evaluation expressionAnd obtaining the matching value degree corresponding to each candidate text data in each group of sets.
Substituting the matching value degree corresponding to each candidate text data in each group of sets into a formulaCalculating the matching dominance index corresponding to each group of setsIn the followingThe group set number is indicated and the group set number,,Represent the firstGroup set ofMatching value degrees corresponding to the candidate text data,The number of the alternative text data is represented,,Representing the number of alternative text data present within the group set,、Respectively represent the firstThe corresponding maximum matching value degree and minimum matching value degree in the group set, wherein the smaller the matching value difference of each candidate text data in a certain group set is, the larger the matching value degree is, which indicates that the candidate text data in the group set not only has high matching value, but also has more stable matching value distribution, and further has larger matching advantage.
The sets are arranged in descending order according to the matching dominance indexes to obtain the text matching sequence of the sets, and the alternative text data contained in the sets with larger matching dominance often contains more reliable, more relevant and more valuable data, so that the alternative text data in the sets with larger matching dominance can be subjected to similarity matching preferentially, the number of texts needing to be matched can be reduced, the matching efficiency is improved, the consumption of time and resources is reduced, meanwhile, more similar text data can be found in a more rapid focusing manner, and the matching precision and accuracy are improved.
S6, carrying out hierarchical type analysis on the corresponding candidate text data in each group of sets to obtain hierarchical types corresponding to each candidate text data, wherein the hierarchical types comprise word levels, sentence levels, document levels and others, and the specific analysis is as follows: counting the number of the word segments corresponding to each candidate text data in each group set, wherein the word segments refer to word segments after stop words are removed, and performing word segment coverage calculation, specifically, the maximum number of the effective word segments divided by each text data in a search platform association knowledge base can be taken as the number of reference word segments, and the ratio calculation is performed on the number of the word segments corresponding to the candidate text data and the number of the reference word segments to obtain the word segment coverage.
And comparing the parts of speech of the continuous word segmentation according to the arrangement sequence of the word segmentation, thereby identifying sentence pattern structures, further counting the number of the identified sentence pattern structures, and carrying out sentence pattern coverage calculation.
It is to be added that when the sentence pattern structure in the text data is identified, whether the sentence mark, question mark, exclamation mark and other common sentence terminal symbols exist in the text is not simply observed as the identification basis, but some contents are considered to be not provided with sentence terminal symbols but possibly become a sentence pattern, so that according to the grammar rules followed by the sentence pattern structure, the main predicate structure and the use of modifier are included, and the main predicate generally has corresponding part of speech representation, such as name and pronoun corresponding to the main predicate, and the part of speech corresponding to the predicate is usually a verb, thereby observing the part of speech relationship between the words in the text, judging whether the grammar rules are met, realizing the comprehensive identification of the sentence pattern structure, and greatly avoiding omission. Wherein the sentence pattern coverage is calculated by the same principle as the word segmentation coverage.
Extracting typesetting format characteristics corresponding to each candidate text data in each group of sets, wherein the head line indentation is the typesetting characteristics of paragraphs in general, thus counting the number of typeset paragraphs, and performing paragraph coverage calculation, wherein the calculation of paragraph coverage is similar to that of word segmentation coverage.
Word segmentation coverage, sentence coverage and paragraph coverage corresponding to the alternative text data are passed through an analysis modelObtaining the hierarchy type corresponding to the alternative text dataIn the model、、All of which represent the conditions of constraint,,,Wherein、、Respectively representing word segmentation coverage, sentence pattern coverage and paragraph coverage corresponding to the candidate text data,、、Respectively representing the pre-configured effective word segmentation coverage, the effective sentence pattern coverage and the effective paragraph coverage,、、Respectively, and, or, not.
In the above, the word segmentation coverage, the effective sentence pattern coverage and the paragraph coverage are compared with the corresponding maximum values in the calculation, so that the values of the word segmentation coverage, the effective sentence pattern coverage and the paragraph coverage are between 0 and 1, in this case、、The value of (2) may be 0.8.
It should be added that word-level text refers to a word sequence obtained by word segmentation of text. Generally, word-level text is typically composed of individual words or phrases, without complete sentence structure.
Sentence-level text refers to a sentence sequence obtained by sentence-dividing the text. At this level, the text is divided into complete sentences, each representing a semantically complete meaning unit. Sentence-level text has a definite grammatical structure and logical relationships, and sentences are interrelated to form a complete paragraph.
Document level text generally has a complete paragraph structure and has greater spread and consistency in content.
S7, selecting an adaptation similarity algorithm based on the level type corresponding to each piece of candidate text data, and implementing the following steps: and when the hierarchy type corresponding to the candidate text data is word level, sentence level or document level, selecting a similarity algorithm corresponding to the corresponding hierarchy type as an adaptive similarity algorithm corresponding to the candidate text data.
In one example, cosine similarity, edit distance, lexical similarity, etc. algorithms may be used for word-level text to calculate semantic similarity.
The semantic similarity may be calculated for sentence-level text using cosine similarity, jaccard similarity, or other algorithms.
Semantic similarity can be calculated for document level text using algorithms based on word bag models, TF-IDF, etc.
And when the hierarchy type corresponding to the alternative text data is other, comparing similarity algorithms applicable to the word level, the sentence level and the document level, and selecting an overlapped similarity algorithm from the similarity algorithms as an adaptive similarity algorithm corresponding to the alternative text data.
S8, sequentially calling the candidate text data in the corresponding group sets according to the text matching sequence of the group sets, and carrying out text matching on the called candidate text data with the request problem by utilizing an adaptation similarity algorithm to obtain the semantic similarity of the candidate text data.
S9, selecting a search result corresponding to the request problem based on the semantic similarity of each candidate text data in each group of sets, and outputting and displaying the search result in a search output frame. Specifically, the search result corresponding to the selection request question is found in the following process: and comparing the semantic similarity corresponding to each candidate text data in each group set with the set standard-reaching semantic similarity according to the text matching sequence of the group sets, and extracting the candidate text data as a retrieval result corresponding to the request problem once the semantic similarity corresponding to one candidate text data is greater than or equal to the standard-reaching semantic similarity.
According to the text matching method, the hierarchical type analysis is carried out on the preprocessed text data, and then the adaptation similarity algorithm is selected according to the hierarchical type, so that the text data is subjected to semantic matching by the adaptation similarity algorithm, the targeted operation of text matching is realized, the algorithm with higher calculation effect can be selected according to the texts with different hierarchical types, the matching accuracy and adaptability can be improved, the matching efficiency and performance can be improved, and the consumption of calculation resources is reduced.
The foregoing is merely illustrative of the structures of this invention and various modifications, additions and substitutions for those skilled in the art of describing particular embodiments without departing from the structures of the invention or exceeding the scope of the invention as defined by the claims.
Claims (6)
1. The massive text retrieval matching method based on semantic analysis is characterized by comprising the following steps of:
s1, receiving a search instruction, extracting a request problem currently input in a search platform, and extracting a subject term from the request problem;
S2, comparing the subjects of each text data mark stored in the associated knowledge base of the search platform with the subject words of the request questions, and screening text data conforming to the subject words of the request questions from the subjects as candidate text data;
S3, grouping the alternative text data to obtain a plurality of groups of sets, and carrying out common characteristic identification on each group of sets;
S4, preprocessing the alternative text data in the corresponding group sets according to the common characteristics corresponding to the group sets to obtain the processed alternative text data corresponding to the group sets;
S5, obtaining content credibility, uploading time and historical access frequency corresponding to each candidate text data in each group of sets, and thus determining the text matching sequence of each group of sets;
S6, carrying out hierarchical type analysis on the corresponding alternative text data in each group of sets to obtain hierarchical types corresponding to each piece of alternative text data;
s7, selecting an adaptation similarity algorithm based on the level type corresponding to each piece of alternative text data;
S8, sequentially calling the candidate text data in the corresponding group sets according to the text matching sequence of the group sets, and carrying out text matching on the called candidate text data with the request problem by utilizing an adaptation similarity algorithm to obtain the semantic similarity of each candidate text data;
s9, selecting a search result corresponding to the request problem based on the semantic similarity of each candidate text data in each group of sets, and outputting and displaying the search result in a search output frame;
the grouping of the alternative text data is described in the following procedure:
acquiring uploading time corresponding to each piece of alternative text data, and numbering the alternative text data according to the sequence of the uploading time from first to last;
Extracting the attribution field and the text length corresponding to each candidate text data;
performing sentence segmentation on each candidate text data, and counting the number of segmented sentences;
classifying the candidate text data according to the same attribution field, the same text length and the same sentence number to obtain candidate text data corresponding to a plurality of attribution fields, a plurality of text lengths and a plurality of sentence numbers;
Taking the attribution field, the text length and the sentence number as classification labels, counting the attribution field number, the text length number and the sentence number obtained by classification, and sequentially extracting the attribution field, the text length and the sentence number according to the classification labels to form a joint label set, so as to obtain a plurality of joint label sets;
Comparing the numbers of the candidate text data in the same joint label set, and extracting the candidate text data corresponding to the same numbers from the numbers to form a group set;
the implementation process of the common characteristic identification for each group of sets is as follows:
Dividing each candidate text data in each group of sets into words, traversing the divided text by using a stop word list, counting the number of stop words, and further carrying out proportional calculation on the number of stop words and the divided total word number to obtain a stop word occupation ratio corresponding to each candidate text data;
traversing text character strings of each candidate text data in each group of sets, calculating the number of symbols in the text, and calculating the proportion of the number of symbols to the length of the text to obtain symbol occupation values corresponding to each candidate text data;
Part of speech tagging and root tagging are carried out on the segmented words divided by the alternative text data in each group set, and then the segmented words corresponding to the same root are classified to obtain a segmented word set corresponding to each root, and then part of speech corresponding to each segmented word in the segmented word set is compared, repeated parts of speech are de-duplicated to obtain the number of parts of speech in the segmented word set corresponding to each root, so that part of speech diversity corresponding to each alternative text data is calculated In/>、/>Respectively represent the/>, in the candidate text dataThe root corresponds to the part of speech quantity, word segmentation quantity, and/or the word segmentation quantity in the word segmentation setRepresenting root number,/>,/>Representing the number of root words present in the segmented words of the candidate text data division,/>Representing natural constants;
arranging the stop word occupation ratio, the symbol occupation ratio and the part-of-speech diversity corresponding to each candidate text data in the same group of sets according to the sequence from big to small to obtain label ordering corresponding to each candidate text data;
Comparing the label sequences corresponding to the alternative text data in each group of sets, summarizing the alternative text data corresponding to the same label sequences, and further taking the label sequence with the highest occurrence frequency as a common characteristic corresponding to each group of sets;
The following operation is carried out on the alternative text data in the corresponding group set according to the common characteristics corresponding to the group set:
determining the implementation sequence of a pretreatment flow corresponding to each group of sets according to the common characteristics corresponding to each group of sets, wherein the pretreatment flow comprises word deactivation, character removal and word drying;
preprocessing the alternative text data in each group set according to the execution sequence of the preprocessing flow;
The hierarchical type parsing is described in the following steps:
Counting the number of word segmentation corresponding to each candidate text data in each group of sets, and calculating word segmentation coverage;
Comparing the parts of speech of the continuous word segmentation according to the arrangement sequence of the word segmentation, thereby identifying sentence pattern structures, further counting the number of the identified sentence pattern structures, and carrying out sentence pattern coverage calculation;
Extracting typesetting format characteristics corresponding to each candidate text data in each group of sets, thereby counting the number of typeset paragraphs and performing paragraph coverage calculation;
word segmentation coverage, sentence coverage and paragraph coverage corresponding to the alternative text data are passed through an analysis model Obtaining the hierarchy type/>, corresponding to the candidate text dataIn the model/>、/>、/>All represent constraint conditions,/>,/>,Wherein/>、/>、/>Respectively represent word segmentation coverage, sentence coverage and paragraph coverage corresponding to the candidate text data,/>、/>、/>Respectively represent the pre-configured effective word segmentation coverage, effective sentence pattern coverage and effective paragraph coverage,/>、/>、/>Respectively, and, or, not.
2. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the implementation process of extracting the subject term of the request problem is as follows:
dividing the request problem into words, and removing stop words from the divided words to obtain a plurality of effective words;
acquiring the part of speech of each effective word, and screening out key word from each effective word according to the part of speech;
and comparing the key word segments, identifying whether repeated key word segments exist, and if so, performing de-duplication processing on the repeated key word segments to obtain the subject word.
3. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the history access frequency obtaining process is as follows:
Comparing the uploading time of each alternative text data in each group of sets, and further taking the furthest uploading time and the current time to form an access period corresponding to each group of sets;
And counting the access frequency corresponding to each candidate text data in the access period corresponding to each group of sets, dividing the access frequency by the sum of the access frequencies corresponding to all the candidate text data in the group of sets, and obtaining the historical access frequency corresponding to each candidate text data.
4. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the text matching sequence of each group set is determined as follows:
matching the attribution field corresponding to each candidate text data in each group of sets with the preset content credibility corresponding to each attribution field to obtain the content credibility corresponding to each candidate text data;
comparing the uploading time corresponding to each candidate text data in each group of sets with the current time, and calculating the content timeliness corresponding to each candidate text data, wherein the specific expression is as follows ;
Importing the content credibility, the content timeliness and the historical access frequency corresponding to each candidate text data in each group of sets into an evaluation expressionObtaining matching value degrees corresponding to the candidate text data in each group of sets;
substituting the matching value degree corresponding to each candidate text data in each group of sets into a formula Calculating matching dominance index/>, corresponding to each group of setsIn/>Representing group set number,/>,/>Represents the/>Within group set/>Matching value degree corresponding to candidate text data,/>Representing alternative text data numbers,/>,/>Representing the number of alternative text data present within the group set,/>、Respectively represent the/>Maximum matching value and minimum matching value corresponding to the group set;
And arranging the sets according to the descending order of the matching dominance indexes to obtain the text matching order of the sets.
5. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the selection adaptation similarity algorithm is implemented as follows:
when the hierarchy type corresponding to the candidate text data is word level, sentence level or document level, selecting a similarity algorithm corresponding to the corresponding hierarchy type as an adaptation similarity algorithm corresponding to the candidate text data;
And when the hierarchy type corresponding to the alternative text data is other, comparing similarity algorithms applicable to the word level, the sentence level and the document level, and selecting an overlapped similarity algorithm from the similarity algorithms as an adaptive similarity algorithm corresponding to the alternative text data.
6. The semantic analysis-based massive text retrieval matching method as claimed in claim 1, wherein: the search result corresponding to the selection request question is found in the following process:
And comparing the semantic similarity corresponding to each candidate text data in each group set with the set standard-reaching semantic similarity according to the text matching sequence of the group sets, and extracting the candidate text data as a retrieval result corresponding to the request problem once the semantic similarity corresponding to one candidate text data is greater than or equal to the standard-reaching semantic similarity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410386961.4A CN117972025B (en) | 2024-04-01 | 2024-04-01 | Massive text retrieval matching method based on semantic analysis |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410386961.4A CN117972025B (en) | 2024-04-01 | 2024-04-01 | Massive text retrieval matching method based on semantic analysis |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117972025A CN117972025A (en) | 2024-05-03 |
CN117972025B true CN117972025B (en) | 2024-06-07 |
Family
ID=90859946
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410386961.4A Active CN117972025B (en) | 2024-04-01 | 2024-04-01 | Massive text retrieval matching method based on semantic analysis |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117972025B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9836533B1 (en) * | 2014-04-07 | 2017-12-05 | Plentyoffish Media Ulc | Apparatus, method and article to effect user interest-based matching in a network environment |
CN109255121A (en) * | 2018-07-27 | 2019-01-22 | 中山大学 | A kind of across language biomedicine class academic paper information recommendation method based on theme class |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
CN114579693A (en) * | 2021-12-02 | 2022-06-03 | 广州趣丸网络科技有限公司 | NLP text security audit multistage retrieval system |
CN116756347A (en) * | 2023-08-21 | 2023-09-15 | 中国标准化研究院 | Semantic information retrieval method based on big data |
CN117763876A (en) * | 2024-02-19 | 2024-03-26 | 浙江大学 | Industrial equipment parallel simulation analysis method and device |
-
2024
- 2024-04-01 CN CN202410386961.4A patent/CN117972025B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9836533B1 (en) * | 2014-04-07 | 2017-12-05 | Plentyoffish Media Ulc | Apparatus, method and article to effect user interest-based matching in a network environment |
CN109255121A (en) * | 2018-07-27 | 2019-01-22 | 中山大学 | A kind of across language biomedicine class academic paper information recommendation method based on theme class |
CN111104794A (en) * | 2019-12-25 | 2020-05-05 | 同方知网(北京)技术有限公司 | Text similarity matching method based on subject words |
CN114579693A (en) * | 2021-12-02 | 2022-06-03 | 广州趣丸网络科技有限公司 | NLP text security audit multistage retrieval system |
CN116756347A (en) * | 2023-08-21 | 2023-09-15 | 中国标准化研究院 | Semantic information retrieval method based on big data |
CN117763876A (en) * | 2024-02-19 | 2024-03-26 | 浙江大学 | Industrial equipment parallel simulation analysis method and device |
Non-Patent Citations (3)
Title |
---|
基于XML的异构数据信息交换技术分析;马成英等;计算机工程;20240229;第53卷(第2期);全文 * |
基于分级匹配的维吾尔语文档相似性计算及剽窃检测方法;亚森・艾则孜;艾山・吾买尔;阿力木江・艾沙;;计算机应用研究;20190615(06);全文 * |
面向汽车领域的信息检索系统研究与实现;汪婧昀;中国优秀硕士学位论文全文数据库;20240115;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117972025A (en) | 2024-05-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cao et al. | A joint model for word embedding and word morphology | |
US11210468B2 (en) | System and method for comparing plurality of documents | |
JP5338238B2 (en) | Automatic ontology generation using word similarity | |
JP5710581B2 (en) | Question answering apparatus, method, and program | |
CN112395395B (en) | Text keyword extraction method, device, equipment and storage medium | |
CN111460162B (en) | Text classification method and device, terminal equipment and computer readable storage medium | |
JP4534666B2 (en) | Text sentence search device and text sentence search program | |
CN112380866A (en) | Text topic label generation method, terminal device and storage medium | |
CN114997288A (en) | Design resource association method | |
CN110705285A (en) | Government affair text subject word bank construction method, device, server and readable storage medium | |
CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
Tahir et al. | FNG-IE: an improved graph-based method for keyword extraction from scholarly big-data | |
CN109298796A (en) | A kind of Word association method and device | |
CN106294689B (en) | A kind of method and apparatus for selecting to carry out dimensionality reduction based on text category feature | |
Mekki et al. | Tokenization of Tunisian Arabic: a comparison between three Machine Learning models | |
CN110941713B (en) | Self-optimizing financial information block classification method based on topic model | |
CN111046168A (en) | Method, apparatus, electronic device, and medium for generating patent summary information | |
CN117972025B (en) | Massive text retrieval matching method based on semantic analysis | |
CN113449063B (en) | Method and device for constructing document structure information retrieval library | |
CN115587163A (en) | Text classification method and device, electronic equipment and storage medium | |
CN112612867B (en) | News manuscript propagation analysis method, computer readable storage medium and electronic device | |
CN115269846A (en) | Text processing method and device, electronic equipment and storage medium | |
CN115203206A (en) | Data content searching method and device, computer equipment and readable storage medium | |
CN115455975A (en) | Method and device for extracting topic keywords based on multi-model fusion decision | |
Alorini et al. | Machine learning enabled sentiment index estimation using social media big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant |