CN114153949B - Word segmentation retrieval method and system - Google Patents

Word segmentation retrieval method and system Download PDF

Info

Publication number
CN114153949B
CN114153949B CN202111512996.0A CN202111512996A CN114153949B CN 114153949 B CN114153949 B CN 114153949B CN 202111512996 A CN202111512996 A CN 202111512996A CN 114153949 B CN114153949 B CN 114153949B
Authority
CN
China
Prior art keywords
word
corpus
relevancy
retrieval
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111512996.0A
Other languages
Chinese (zh)
Other versions
CN114153949A (en
Inventor
付雪林
王涛
孙思遥
邓应来
王启超
吴邱思
安重阳
韩啸
张葳
曾明泉
唐海霞
赵鑫
刘成书
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xin Li Fang Technologies Inc
Original Assignee
Beijing Xin Li Fang Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xin Li Fang Technologies Inc filed Critical Beijing Xin Li Fang Technologies Inc
Priority to CN202111512996.0A priority Critical patent/CN114153949B/en
Publication of CN114153949A publication Critical patent/CN114153949A/en
Application granted granted Critical
Publication of CN114153949B publication Critical patent/CN114153949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Abstract

The application provides a method and a system for word segmentation retrieval. The method comprises the following steps: receiving a search word input by a user; carrying out single character word segmentation on the search word; respectively calculating the single word relevancy of each corpus document; superposing the word relevancy to generate a relevancy score of the corpus document; and sorting the corpus documents according to the relevancy scores to generate a first retrieval result. In the single-domain information retrieval platform, the retrieval words are split in a single word segmentation mode, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted through relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.

Description

Word segmentation retrieval method and system
Technical Field
The present application relates to the field of search technologies, and in particular, to a method and a system for performing a word segmentation search.
Background
With the continuous development of the internet technology, various platforms are set up in the aspect of instrument information, so that users can retrieve various information about instruments through the platforms, including consultation in the vertical field, manufacturers, instruments, communities, data, network lecture halls, instrument currencies, recruitment, consumables, reagents, industrial applications, special subjects, market research and exhibition columns.
In a traditional instrument information platform, grammar dependency relationship configuration is generally carried out on user search terms in a mode of building a semantic template so as to generate different retrieval content sequences.
The instrument information platform has the characteristics of multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, if the aim of accurate hit is to be achieved in the retrieval process, the semantic template needs to be continuously maintained and updated with extremely high cost, and particularly when the user amount is continuously increased, more and more users search in the cross-field mode, so that the maintenance cost of the instrument information platform is further increased. The profitability of the instrument information platform is limited by the market served by the instrument information platform, and the requirement of the instrument information platform with increasing cost cannot be met, so that the maintenance of the traditional instrument information platform is low, and the retrieval hit rate is reduced.
Disclosure of Invention
In order to reduce the retrieval cost of an instrument information platform, the application aims to provide a participle retrieval method and a participle retrieval system.
The above application purpose of the present application is achieved by the following technical solutions:
in a first aspect, the present application provides a word segmentation search method applied to a single-domain information search platform, where the method includes:
receiving a search word input by a user;
carrying out single word segmentation on the search word;
respectively calculating the single word relevancy of each corpus document;
superposing the single word relevancy to generate relevancy scores of the corpus documents;
and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.
By adopting the technical scheme, in the single-field information retrieval platform, the retrieval words are split in a word segmentation mode of single words, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted by relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.
Further, the method further comprises:
and after the corpus documents are sorted according to the relevancy scores, acquiring a preset number of corpus documents according to a ranking sequence to generate a first retrieval result.
By adopting the technical scheme, under the condition of multiple data structure types, namely multiple column types, the limitation of the preset number reduces the number of the corpus documents output at a single time, and synchronous display of the corpus documents of multiple columns can be realized in an auxiliary manner.
Further, the method for respectively calculating the single word relevancy of each corpus document comprises the following steps:
calculate the word q i Inverse document frequency idf (q) i ),
Figure GDA0003905436470000021
Calculating the single word q i Word frequency tf (q) in corpus document D i ,D),tf(q i ,D)=((k+1)*tf)/(k*(1-b+b*L)+tf);
Calculating the single word q i Word relevance in corpus document D word relevance score (D, qi),
Figure GDA0003905436470000022
wherein the content of the first and second substances,
f(q i ,D)=tf(q i d) + Norm, norm being the field length normalization value;
i is a natural number, and N is the total amount of the corpus document D;
df t for the appearance of a single word q i The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
tf=df t /N;
Figure GDA0003905436470000023
| D | is the length of corpus document D;
avg D1 is the average length of the corpus document D.
By adopting the technical scheme, on the basis of the traditional if-idf analysis model, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of single word segmentation retrieval so as to meet the requirement of single field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.
Further, the method further comprises: after the word relevancy is superposed to generate the relevancy score of the corpus document,
and calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result.
Further, the preset weighting rule includes a service weighting rule and a relevancy weighting rule.
Further, the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.
Further, the preset ordering rule includes:
respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;
according to preset priority rules and the times of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;
and sorting the columns according to the column scores to generate a second retrieval result.
By adopting the technical scheme, the number of the columns is limited in a single field, and the columns are sorted in a mode of coexistence of multiple models, so that the user requirements can be more met, and meanwhile, the maintenance cost of each model is greatly reduced compared with the model maintenance cost when the model is used for sorting the corpus documents with large data volume.
In a second aspect, the present application provides a word segmentation retrieval system applied to a single domain information retrieval platform, the system comprising:
the receiving module is used for receiving a search term input by a user;
the word segmentation module is used for carrying out word segmentation on the single words of the search word;
the single word calculation module is used for calculating the single word relevancy of each corpus document;
the relevancy calculation module is used for superposing the single word relevancy to generate a relevancy score of the corpus document;
and the output module is used for sequencing the corpus documents according to the relevancy scores to generate a first retrieval result.
Further, the system further comprises:
and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.
Further, the method for calculating the relevance of the single character by the single character calculation module comprises the following steps:
calculate the word q i The inverse document frequency idf (qi),
Figure GDA0003905436470000041
calculating the single word q i Word frequency tf (q) in corpus document D i ,D),tf(q i ,D)=((k+1)*tf)/(k*(1-b+b*L)+tf);
Calculating the single word q i Word relevance in corpus document D word relevance score (D, q) i ),
Figure GDA0003905436470000042
Wherein the content of the first and second substances,
f(q i ,D)=tf(q i d) + Norm, norm being the field length normalization value;
i is a natural number, and N is the total amount of the corpus documents D;
df t for the appearance of a single word q i The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
tf=df t /N;
Figure GDA0003905436470000043
| D | is the length of the corpus document D;
avg Dl is the average length of the corpus document D.
In summary, the present application includes at least one of the following beneficial technical effects:
1. the maintenance cost of the single-field information retrieval platform is reduced, and the manual maintenance cost is saved in the aspect of retrieval of the corpus documents, so that the maintenance cost of the platform is reduced;
2. the hit rate of platform retrieval is improved, and the hit rate of a user in the process of using the platform retrieval is improved no matter a brand-new single word relevance calculation method or a matching mode of column sequencing and corpus document sequencing.
Drawings
Fig. 1 is a schematic flow chart of a word segmentation retrieval method according to the present application.
Fig. 2 is a flowchart illustrating a method for generating a second search result according to the present application.
Fig. 3 is a system diagram of an example of the present document participle search system.
Fig. 4 is a system diagram of another example of the present document participle search system.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.
The embodiment of the application provides a word segmentation retrieval method which is applied to a single-field information retrieval platform. The single-domain information retrieval platform refers to a retrieval platform in a limited domain, such as an instrument information platform, a medicine information platform, and the like, and the following embodiments only use the instrument information platform as an example to introduce the scheme of the present application, but do not limit the domain types.
In order to improve the hit rate of user retrieval, the single-field information retrieval platform divides a plurality of columns according to the vertical field of instrument information, and specifically comprises consultation, manufacturers, instruments, communities, data, network lectures, instrument curriculum, recruitment, consumables, reagents, industry applications, special subjects, market research and exhibition columns, wherein a corpus document library is stored in the instrument information platform, and the corpus documents are stored in a partition mode according to column types, so that a user can retrieve the corpus documents contained in the related columns in each column. Meanwhile, the instrument information platform is also provided with a total-station retrieval mode, namely, a user retrieves all corpus documents in the instrument information platform through retrieval words.
Referring to FIG. 1, in one example, a corpus document is retrieved within a single column as follows.
Step S101: and receiving a search word input by a user.
Specifically, the search term may be a sentence, a phrase, a single word, or a word composed of multiple single words, and the manner in which the instrument information platform receives the search term input by the user may be various, for example, the search term is input through a touch screen, the search term is input through voice, the search term is input through data transmission, or the search term is input through a keyboard, and accordingly, different search term input manners may be equipped with corresponding input devices, which is not limited uniquely herein.
Step S102: and carrying out single word segmentation on the search word. Specifically, the word segmentation means that each word in the search word input by the user is regarded as a segmentation word, and for example, the search word input by the user is regarded as "Qingdao Lubo", and the search word is divided into four words, i.e., "Qingdao", "island", "Luo", and "Bo".
Step S103: and respectively calculating the single word relevancy of each corpus document.
The method specifically comprises the following steps:
calculate the word q i Inverse document frequency idf (q) i ),
Figure GDA0003905436470000061
i is a natural number;
calculating the single word q i Word frequency tf (q) in corpus document D i ,D),tf(q i ,D)=((k+1)*tf)/(k*(1-b+b*L)+tf);
Calculating the single word q i Word relevancy score (D, q) in corpus document D i ),
Figure GDA0003905436470000062
Wherein, the first and the second end of the pipe are connected with each other,
f(q i ,D)=tf(q i d) + Norm, norm being the normalized value of the field length;
i is a natural number, and N is the total amount of the corpus document D;
df t for the appearance of a single word q i The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
tf=df t /N;
Figure GDA0003905436470000063
| D | is the length of corpus document D;
avg Di is the average length of corpus documents D.
The if-idf analysis model is introduced in the calculation process of the single-word correlation degree. In the application, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of word segmentation retrieval of single words so as to meet the requirement of single-field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.
Step S104: and superposing the word relevancy to generate the relevancy score of the corpus document.
Specifically, for a corpus document, after calculating the relevance of the individual character corresponding to each individual character, the relevance scores of the retrieval words relative to the corpus document can be obtained by adding the relevance of the individual characters forming the retrieval words. For example, taking the search word input by the user as "Qingdao Lubo" as an example, for a corpus document, the relevancy of the words "Qingdao", "island", "Luo" and "Bo" is 536.26274, 789.53536, 841.99603 and 486.35306 respectively, and then the relevancy score of the corpus document is 536.26274+789.53536+841.99603+486.35306.
Step S105: and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.
Specifically, when a user searches in a single column, because the number of corpus documents that can be displayed simultaneously in the single column is relatively large, the generated first search result is the result obtained after the corpus documents are sorted; in another example, in step S104, after the single word relevancy is superimposed to generate the relevancy score of the corpus documents, the special weighting score of the corpus documents is calculated according to a preset weighting rule, and the corpus documents are sorted according to the sum of the relevancy score and the special weighting score to generate the first retrieval result.
The preset weighting rules are used for carrying out secondary scoring on all the recalled expected documents and comprise business weighting rules and relevancy weighting rules.
And the service weighting rule represents that the recalling result is added according to a weighting rule preset by a user. Such as the position of the search term appearing in the corpus document, the number of times the search term appears, the position of the search term appearing in different classification levels, etc., all have different bonus scores. The rule is a preset rule of the user, and is not described herein too much.
And the relevancy weighting rule expresses that the recall result is added according to the number of the continuously hit search words in the corpus documents. If all the hit single words are continuously hit, the score is highest, so that the corpus document score of all the continuously hit single words is highest; if the continuous part hits the single character, adding different scores to the document according to different numbers of the continuous part hit single characters, wherein the larger the number of the continuous part hit single characters is, the higher the score is.
If the search word is "Qingdao Lubo", for example, if the corpus document continuously hits on all four words of "Qingdao", "island", "way" and "bo", 10000 points are added to the corpus document, if only three words of "Qingdao", "island" and "way" are continuously hit, 50 points are added to the corpus document, if only two words of "Qingdao" and "island" are continuously hit, 30 points are added to the corpus document, and if only one word is hit, the corpus document is abandoned.
Further, when the user conducts retrieval in a total-station retrieval mode, the columns are sorted according to a preset sorting rule to generate a second retrieval result, and the first retrieval result and the second retrieval result are combined to be called a final retrieval result.
Referring to fig. 2, the method for sorting the columns according to the preset sorting rule to generate the second search result includes:
step S201: respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;
step S202: according to a preset priority rule and the state of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;
step S203: and sorting the columns according to the column scores to generate a second retrieval result.
The relevant column model of the search term is as follows: and calculating the similarity of the search words and the historical search data of each column by using a word vector model to generate a column sequence, wherein the mode is a known search model and is not expanded in detail.
The user preference column model is as follows: and generating a column model related to the search term through historical behaviors of the current user, such as searching behavior, clicking behavior, comments, praise and the like, and obtaining whether the preferred column of the user is an instrument, information, data and the like. Specifically, the user preference column model is a calculation method that the historical behavior of the user is analyzed through the historical behavior of the current user of the user, the number of times of the column is the largest, and the retention time is the longest: and counting the times of entering each column and determining the stay time of each column together. And (3) calculating a rule: the individual column preference score is =50 (the number of times of the present column)/(the number of times of all the columns) +50 (the browsing time length of the present column)/(the browsing time length of all the columns), and the final score is used for sorting the columns.
The search term click preference column model is as follows: and generating column sequencing of the search term through the clicking behaviors of all users in all columns under the same search term of the platform. The click behavior may be a number of clicks or a click time interval.
It should be noted that when the columns are respectively sorted by the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model, a certain number of columns are obtained according to the sorting as the output result of each model, and the state that the columns appear in the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model according to the preset priority rule indicates that there is a related column or no related column in the model output result. And when the output result of the model has related columns, performing specific gravity addition of column score on the model. In the process of performing proportion addition on the column scores, if the column is added with 2 points in the output result of the user preference column model, the column is added with 4 points in the output result of the search term related column model, the column is added with 5 points in the output result of the search term click preference column model, and the column is added with 10 points in the output result of the grammar dependency relationship model.
In another preferred example, referring to fig. 3, the present application further discloses a word segmentation retrieval system applied to a single domain information retrieval platform, the system including:
the receiving module is used for receiving a search term input by a user;
the word segmentation module is used for carrying out word segmentation on the single words of the search word;
the single word calculation module is used for calculating the single word relevancy of each corpus document;
the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;
and the output module is used for sorting the corpus documents according to the relevancy scores to generate a first retrieval result.
And the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.
The method for calculating the single word relevancy by the single word calculation module comprises the following steps:
calculate the word q i Inverse document frequency idf (q) i ),
Figure GDA0003905436470000091
Calculating the single word q i Word frequency tf (q) in corpus document D i ,D),tf(q i ,D)=((k+1)*tf)/(k*(1-b+b*L)+tf);
Calculating the single word q i Word relevancy score (D, q) in corpus document D i ),
Figure GDA0003905436470000092
Wherein the content of the first and second substances,
f(q i ,D)=tf(q i d) + Norm, norm being the field length normalization value;
i is a natural number N and is the total amount of the corpus document D;
df t for the appearance of a single word q i The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
tf=df t /N;
Figure GDA0003905436470000093
| D | is the length of corpus document D;
avg Dl is the average length of corpus documents D.
The word segmentation retrieval system further comprises a column sorting module, wherein the column sorting module is used for sorting the columns respectively through a user preference column model, a retrieval word related column model, a retrieval word click preference column model and a grammar dependence relation model, grading each column respectively according to a preset priority rule, and finally sorting the columns according to the grading result to generate a second retrieval result.
And selecting a preset number of columns according to column scores of the second retrieval result and the sequence from top to bottom of the column scores, wherein the number of the corpus documents in each column is selected according to the sequence from top to bottom of the sequence of the corpus documents in the first retrieval result and the sequence of the corpus documents according to the sequence from top to bottom of the corpus documents according to the sequence rule of the corpus documents in each column and the number of the corpus documents in each column.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
When a user needs to search, search words are input through an instrument information network terminal, a preset number of columns are selected through a column sorting module, a preset number of corpus documents are selected through a receiving module, a word segmentation module, a single word calculation module, a relevance calculation module and an output module, and finally the selected columns and the corpus documents are returned to an instrument information network.
Furthermore, the setting positions of the receiving module, the word segmentation module, the single word calculation module, the correlation degree calculation module, the output module and the column sorting module are not limited uniquely in the application. Referring to fig. 3, in an example, the receiving module is disposed at the terminal of the instrument information network, and the word segmentation module, the single word calculation module, the correlation calculation module, the output module, and the column sorting module are all disposed in a server of the platform of the instrument information network. Referring to fig. 4, in another example, the receiving module, the word segmentation module and the column sorting module are all disposed in the instrumentation information network terminal to share data processing pressure of the instrumentation information network platform server through a processor of the instrumentation information network terminal, and the single word calculation module, the relevance calculation module and the output module are all disposed in the instrumentation information network platform server.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative. In addition, the shown or discussed couplings or direct couplings or data communication connections between each other may be through some interfaces.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is also to be understood that, in various embodiments of the present application, unless otherwise specified or conflicting in logic, terms and/or descriptions between different embodiments are consistent and may be mutually referenced, and technical features in different embodiments may be combined to form a new embodiment according to their inherent logical relationship.
The embodiments of the present invention are all preferred embodiments of the present application, and the protection scope of the present application is not limited thereby, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims (6)

1. A word segmentation retrieval method is characterized in that: the method is applied to a single-domain information retrieval platform, and comprises the following steps:
receiving a search word input by a user;
carrying out single character word segmentation on the search word;
respectively calculating the single word relevancy of each corpus document;
superposing the single word relevancy to generate relevancy scores of the corpus documents;
calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result;
the preset weighting rule comprises the following steps: a business weighting rule and a relevancy weighting rule;
the service weighting rule comprises: the positions of the search terms appearing in the corpus documents, the times of the search terms appearing and the positions of the search terms appearing in different classification levels;
the relevancy weighting rule comprises the following steps: scoring the recall result according to the number of continuously hit search words in the corpus documents; the method for respectively calculating the single word relevancy of each corpus document comprises the following steps:
calculate the word q i Inverse document frequency idf (q) of (1) i )
Figure FDA0003868444850000011
Calculating the single word q i Word frequency tf (q) in corpus document d i ,d),tf(q i ,d)=((k+1)*tf)/(k*(1-b+b*L)+tf);
Calculating the single word q i Word relevancy in corpus document d
Figure FDA0003868444850000012
Wherein the content of the first and second substances,
f(q i ,d)=tf(q i d) + Norm where Norm isThe field length is normalized;
i is a natural number, and N is the total amount of the corpus documents;
df t for the appearance of a single word q i The number of corpus documents;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, when the value of b is zero, normalization is forbidden, and when the value of b is 1, full normalization is started;
tf=df t /N;
Figure FDA0003868444850000013
i dl is the length of corpus document d;
avg dl is the average length of the corpus documents.
2. The word segmentation retrieval method according to claim 1, wherein the method further comprises:
and after the corpus documents are sorted according to the sum of the relevancy score and the special weighting score, acquiring a preset number of corpus documents according to a ranking sequence to generate a first retrieval result.
3. The word segmentation retrieval method according to claim 1, wherein the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.
4. The word segmentation retrieval method according to claim 3, wherein the preset ordering rule comprises:
respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;
according to a preset priority rule and the times of columns appearing in a user preference column model, a retrieval word related column model, a retrieval word click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;
and sorting the columns according to the column scores to generate a second retrieval result.
5. A word segmentation retrieval system, comprising: applied to a single domain information retrieval platform, the system comprises:
the receiving module is used for receiving a search word input by a user;
the word segmentation module is used for carrying out word segmentation on the single words of the search word;
the single character calculation module is used for calculating the single character relevancy of each corpus document;
the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;
the output module is used for calculating special weighting scores of the corpus documents according to a preset weighting rule, and sorting the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result;
the preset weighting rule comprises the following steps: business weighting rules and relevancy weighting rules
The service weighting rule comprises: the positions of the search terms appearing in the corpus documents, the times of the search terms appearing and the positions of the search terms appearing in different classification levels;
the relevancy weighting rule comprises the following steps: scoring the recall result according to the number of continuously hit search words in the corpus documents;
the method for calculating the single character relevancy by the single character calculation module comprises the following steps:
calculate the word q i Reverse document frequency of
Figure FDA0003868444850000021
Calculating the single word q i Word frequency tf (q) in corpus document d i ,d),tf(q i ,d)=((k+1)*tf)/(k*(1-b+b*L)+tf);
Calculating the single word q i Word relevancy in corpus document d
Figure FDA0003868444850000031
Wherein, the first and the second end of the pipe are connected with each other,
f(q i ,d)=tf(q i d) + Norm, norm being the field length normalized value;
i is a natural number, and N is the total amount of the corpus documents;
df t for the appearance of a single word q i The number of corpus documents;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
tf=df t /N;
Figure FDA0003868444850000032
i dl is the length of corpus document d;
avg dl is the average length of the corpus documents.
6. The word segmentation retrieval system as claimed in claim 5, wherein the system further comprises:
and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are sorted according to the sum of the relevancy score and the special weighting score, and generating the first retrieval result.
CN202111512996.0A 2021-12-11 2021-12-11 Word segmentation retrieval method and system Active CN114153949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111512996.0A CN114153949B (en) 2021-12-11 2021-12-11 Word segmentation retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111512996.0A CN114153949B (en) 2021-12-11 2021-12-11 Word segmentation retrieval method and system

Publications (2)

Publication Number Publication Date
CN114153949A CN114153949A (en) 2022-03-08
CN114153949B true CN114153949B (en) 2022-12-13

Family

ID=80450574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111512996.0A Active CN114153949B (en) 2021-12-11 2021-12-11 Word segmentation retrieval method and system

Country Status (1)

Country Link
CN (1) CN114153949B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0520359A (en) * 1991-07-11 1993-01-29 Nippon Telegr & Teleph Corp <Ntt> Information retrieval system
CN102567376A (en) * 2010-12-16 2012-07-11 中国移动通信集团浙江有限公司 Method and device for recommending personalized search results
CN111538830A (en) * 2020-04-28 2020-08-14 清华大学 French retrieval method, French retrieval device, computer equipment and storage medium
CN113112164A (en) * 2021-04-19 2021-07-13 特变电工股份有限公司新疆变压器厂 Transformer fault diagnosis method and device based on knowledge graph and electronic equipment
CN113486140A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Knowledge question-answer matching method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106095848A (en) * 2016-06-02 2016-11-09 北京奇虎科技有限公司 The method of text association, terminal unit and corresponding server unit
CN111931488B (en) * 2020-09-24 2024-04-05 北京百度网讯科技有限公司 Method, device, electronic equipment and medium for verifying accuracy of judgment result

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0520359A (en) * 1991-07-11 1993-01-29 Nippon Telegr & Teleph Corp <Ntt> Information retrieval system
CN102567376A (en) * 2010-12-16 2012-07-11 中国移动通信集团浙江有限公司 Method and device for recommending personalized search results
CN111538830A (en) * 2020-04-28 2020-08-14 清华大学 French retrieval method, French retrieval device, computer equipment and storage medium
CN113112164A (en) * 2021-04-19 2021-07-13 特变电工股份有限公司新疆变压器厂 Transformer fault diagnosis method and device based on knowledge graph and electronic equipment
CN113486140A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Knowledge question-answer matching method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于Lucene的搜索排序算法研究与实现;刘琼茹;《无线互联科技》;20170225(第04期);第143-146页 *

Also Published As

Publication number Publication date
CN114153949A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
US11868386B2 (en) Method and system for sentiment analysis of information
US9971974B2 (en) Methods and systems for knowledge discovery
US9183274B1 (en) System, methods, and data structure for representing object and properties associations
RU2487403C1 (en) Method of constructing semantic model of document
US20100293162A1 (en) Automated Keyword Generation Method for Searching a Database
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
KR20080021017A (en) Comparing text based documents
US20090063132A1 (en) Information Processing Apparatus, Information Processing Method, and Program
KR20130036863A (en) Document classifying system and method using semantic feature
JPH03172966A (en) Similar document retrieving device
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
CN111274358A (en) Text processing method and device, electronic equipment and storage medium
Ercan et al. Anlamver: Semantic model evaluation dataset for turkish-word similarity and relatedness
US20140089246A1 (en) Methods and systems for knowledge discovery
CN110516062B (en) Method and device for searching and processing document
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
JP2010092357A (en) Facility-related information retrieval method and facility-related information retrieval system
CN113190593A (en) Search recommendation method based on digital human knowledge graph
CN114153949B (en) Word segmentation retrieval method and system
Ciuffreda et al. A usability Study of multimodal interfaces for the presentation of Internet Search Results
CN112100330B (en) Topic searching method and system based on artificial intelligence technology
CN111581326B (en) Method for extracting answer information based on heterogeneous external knowledge source graph structure
CN112541069A (en) Text matching method, system, terminal and storage medium combined with keywords

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant