CN114153949B

CN114153949B - Word segmentation retrieval method and system

Info

Publication number: CN114153949B
Application number: CN202111512996.0A
Authority: CN
Inventors: 付雪林; 王涛; 孙思遥; 邓应来; 王启超; 吴邱思; 安重阳; 韩啸; 张葳; 曾明泉; 唐海霞; 赵鑫; 刘成书
Original assignee: Beijing Xin Li Fang Technologies Inc
Current assignee: Beijing Xin Li Fang Technologies Inc
Priority date: 2021-12-11
Filing date: 2021-12-11
Publication date: 2022-12-13
Anticipated expiration: 2041-12-11
Also published as: CN114153949A

Abstract

The application provides a method and a system for word segmentation retrieval. The method comprises the following steps: receiving a search word input by a user; carrying out single character word segmentation on the search word; respectively calculating the single word relevancy of each corpus document; superposing the word relevancy to generate a relevancy score of the corpus document; and sorting the corpus documents according to the relevancy scores to generate a first retrieval result. In the single-domain information retrieval platform, the retrieval words are split in a single word segmentation mode, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted through relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.

Description

Word segmentation retrieval method and system

Technical Field

The present application relates to the field of search technologies, and in particular, to a method and a system for performing a word segmentation search.

Background

With the continuous development of the internet technology, various platforms are set up in the aspect of instrument information, so that users can retrieve various information about instruments through the platforms, including consultation in the vertical field, manufacturers, instruments, communities, data, network lecture halls, instrument currencies, recruitment, consumables, reagents, industrial applications, special subjects, market research and exhibition columns.

In a traditional instrument information platform, grammar dependency relationship configuration is generally carried out on user search terms in a mode of building a semantic template so as to generate different retrieval content sequences.

The instrument information platform has the characteristics of multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, if the aim of accurate hit is to be achieved in the retrieval process, the semantic template needs to be continuously maintained and updated with extremely high cost, and particularly when the user amount is continuously increased, more and more users search in the cross-field mode, so that the maintenance cost of the instrument information platform is further increased. The profitability of the instrument information platform is limited by the market served by the instrument information platform, and the requirement of the instrument information platform with increasing cost cannot be met, so that the maintenance of the traditional instrument information platform is low, and the retrieval hit rate is reduced.

Disclosure of Invention

In order to reduce the retrieval cost of an instrument information platform, the application aims to provide a participle retrieval method and a participle retrieval system.

The above application purpose of the present application is achieved by the following technical solutions:

in a first aspect, the present application provides a word segmentation search method applied to a single-domain information search platform, where the method includes:

receiving a search word input by a user;

carrying out single word segmentation on the search word;

respectively calculating the single word relevancy of each corpus document;

superposing the single word relevancy to generate relevancy scores of the corpus documents;

and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.

By adopting the technical scheme, in the single-field information retrieval platform, the retrieval words are split in a word segmentation mode of single words, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted by relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.

Further, the method further comprises:

and after the corpus documents are sorted according to the relevancy scores, acquiring a preset number of corpus documents according to a ranking sequence to generate a first retrieval result.

By adopting the technical scheme, under the condition of multiple data structure types, namely multiple column types, the limitation of the preset number reduces the number of the corpus documents output at a single time, and synchronous display of the corpus documents of multiple columns can be realized in an auxiliary manner.

Further, the method for respectively calculating the single word relevancy of each corpus document comprises the following steps:

calculate the word q _i Inverse document frequency idf (q) _i )，

Calculating the single word q _i Word frequency tf (q) in corpus document D _i ，D)，tf(q _i ，D)＝((k+1)*tf)/(k*(1-b+b*L)+tf)；

Calculating the single word q _i Word relevance in corpus document D word relevance score (D, qi),

wherein the content of the first and second substances,

f(q _i ，D)＝tf(q _i d) + Norm, norm being the field length normalization value;

i is a natural number, and N is the total amount of the corpus document D;

df _t for the appearance of a single word q _i The number of corpus documents D;

k is a constant;

b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;

tf＝df _t /N；

| D | is the length of corpus document D;

avg ^D1 is the average length of the corpus document D.

By adopting the technical scheme, on the basis of the traditional if-idf analysis model, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of single word segmentation retrieval so as to meet the requirement of single field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.

Further, the method further comprises: after the word relevancy is superposed to generate the relevancy score of the corpus document,

and calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result.

Further, the preset weighting rule includes a service weighting rule and a relevancy weighting rule.

Further, the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.

Further, the preset ordering rule includes:

respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;

according to preset priority rules and the times of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;

and sorting the columns according to the column scores to generate a second retrieval result.

By adopting the technical scheme, the number of the columns is limited in a single field, and the columns are sorted in a mode of coexistence of multiple models, so that the user requirements can be more met, and meanwhile, the maintenance cost of each model is greatly reduced compared with the model maintenance cost when the model is used for sorting the corpus documents with large data volume.

In a second aspect, the present application provides a word segmentation retrieval system applied to a single domain information retrieval platform, the system comprising:

the receiving module is used for receiving a search term input by a user;

the word segmentation module is used for carrying out word segmentation on the single words of the search word;

the single word calculation module is used for calculating the single word relevancy of each corpus document;

the relevancy calculation module is used for superposing the single word relevancy to generate a relevancy score of the corpus document;

and the output module is used for sequencing the corpus documents according to the relevancy scores to generate a first retrieval result.

Further, the system further comprises:

and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.

Further, the method for calculating the relevance of the single character by the single character calculation module comprises the following steps:

calculate the word q _i The inverse document frequency idf (qi),

Calculating the single word q _i Word relevance in corpus document D word relevance score (D, q) _i )，

Wherein the content of the first and second substances,

i is a natural number, and N is the total amount of the corpus documents D;

k is a constant;

tf＝df _t /N；

| D | is the length of the corpus document D;

avg ^Dl is the average length of the corpus document D.

In summary, the present application includes at least one of the following beneficial technical effects:

1. the maintenance cost of the single-field information retrieval platform is reduced, and the manual maintenance cost is saved in the aspect of retrieval of the corpus documents, so that the maintenance cost of the platform is reduced;

2. the hit rate of platform retrieval is improved, and the hit rate of a user in the process of using the platform retrieval is improved no matter a brand-new single word relevance calculation method or a matching mode of column sequencing and corpus document sequencing.

Drawings

Fig. 1 is a schematic flow chart of a word segmentation retrieval method according to the present application.

Fig. 2 is a flowchart illustrating a method for generating a second search result according to the present application.

Fig. 3 is a system diagram of an example of the present document participle search system.

Fig. 4 is a system diagram of another example of the present document participle search system.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.

The embodiment of the application provides a word segmentation retrieval method which is applied to a single-field information retrieval platform. The single-domain information retrieval platform refers to a retrieval platform in a limited domain, such as an instrument information platform, a medicine information platform, and the like, and the following embodiments only use the instrument information platform as an example to introduce the scheme of the present application, but do not limit the domain types.

In order to improve the hit rate of user retrieval, the single-field information retrieval platform divides a plurality of columns according to the vertical field of instrument information, and specifically comprises consultation, manufacturers, instruments, communities, data, network lectures, instrument curriculum, recruitment, consumables, reagents, industry applications, special subjects, market research and exhibition columns, wherein a corpus document library is stored in the instrument information platform, and the corpus documents are stored in a partition mode according to column types, so that a user can retrieve the corpus documents contained in the related columns in each column. Meanwhile, the instrument information platform is also provided with a total-station retrieval mode, namely, a user retrieves all corpus documents in the instrument information platform through retrieval words.

Referring to FIG. 1, in one example, a corpus document is retrieved within a single column as follows.

Step S101: and receiving a search word input by a user.

Specifically, the search term may be a sentence, a phrase, a single word, or a word composed of multiple single words, and the manner in which the instrument information platform receives the search term input by the user may be various, for example, the search term is input through a touch screen, the search term is input through voice, the search term is input through data transmission, or the search term is input through a keyboard, and accordingly, different search term input manners may be equipped with corresponding input devices, which is not limited uniquely herein.

Step S102: and carrying out single word segmentation on the search word. Specifically, the word segmentation means that each word in the search word input by the user is regarded as a segmentation word, and for example, the search word input by the user is regarded as "Qingdao Lubo", and the search word is divided into four words, i.e., "Qingdao", "island", "Luo", and "Bo".

Step S103: and respectively calculating the single word relevancy of each corpus document.

The method specifically comprises the following steps:

calculate the word q _i Inverse document frequency idf (q) _i )，

i is a natural number;

Calculating the single word q _i Word relevancy score (D, q) in corpus document D _i )，

Wherein, the first and the second end of the pipe are connected with each other,

f(q _i ，D)＝tf(q _i d) + Norm, norm being the normalized value of the field length;

i is a natural number, and N is the total amount of the corpus document D;

k is a constant;

tf＝df _t /N；

| D | is the length of corpus document D;

avg ^Di is the average length of corpus documents D.

The if-idf analysis model is introduced in the calculation process of the single-word correlation degree. In the application, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of word segmentation retrieval of single words so as to meet the requirement of single-field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.

Step S104: and superposing the word relevancy to generate the relevancy score of the corpus document.

Specifically, for a corpus document, after calculating the relevance of the individual character corresponding to each individual character, the relevance scores of the retrieval words relative to the corpus document can be obtained by adding the relevance of the individual characters forming the retrieval words. For example, taking the search word input by the user as "Qingdao Lubo" as an example, for a corpus document, the relevancy of the words "Qingdao", "island", "Luo" and "Bo" is 536.26274, 789.53536, 841.99603 and 486.35306 respectively, and then the relevancy score of the corpus document is 536.26274+789.53536+841.99603+486.35306.

Step S105: and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.

Specifically, when a user searches in a single column, because the number of corpus documents that can be displayed simultaneously in the single column is relatively large, the generated first search result is the result obtained after the corpus documents are sorted; in another example, in step S104, after the single word relevancy is superimposed to generate the relevancy score of the corpus documents, the special weighting score of the corpus documents is calculated according to a preset weighting rule, and the corpus documents are sorted according to the sum of the relevancy score and the special weighting score to generate the first retrieval result.

The preset weighting rules are used for carrying out secondary scoring on all the recalled expected documents and comprise business weighting rules and relevancy weighting rules.

And the service weighting rule represents that the recalling result is added according to a weighting rule preset by a user. Such as the position of the search term appearing in the corpus document, the number of times the search term appears, the position of the search term appearing in different classification levels, etc., all have different bonus scores. The rule is a preset rule of the user, and is not described herein too much.

And the relevancy weighting rule expresses that the recall result is added according to the number of the continuously hit search words in the corpus documents. If all the hit single words are continuously hit, the score is highest, so that the corpus document score of all the continuously hit single words is highest; if the continuous part hits the single character, adding different scores to the document according to different numbers of the continuous part hit single characters, wherein the larger the number of the continuous part hit single characters is, the higher the score is.

If the search word is "Qingdao Lubo", for example, if the corpus document continuously hits on all four words of "Qingdao", "island", "way" and "bo", 10000 points are added to the corpus document, if only three words of "Qingdao", "island" and "way" are continuously hit, 50 points are added to the corpus document, if only two words of "Qingdao" and "island" are continuously hit, 30 points are added to the corpus document, and if only one word is hit, the corpus document is abandoned.

Further, when the user conducts retrieval in a total-station retrieval mode, the columns are sorted according to a preset sorting rule to generate a second retrieval result, and the first retrieval result and the second retrieval result are combined to be called a final retrieval result.

Referring to fig. 2, the method for sorting the columns according to the preset sorting rule to generate the second search result includes:

step S201: respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;

step S202: according to a preset priority rule and the state of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;

step S203: and sorting the columns according to the column scores to generate a second retrieval result.

The relevant column model of the search term is as follows: and calculating the similarity of the search words and the historical search data of each column by using a word vector model to generate a column sequence, wherein the mode is a known search model and is not expanded in detail.

The user preference column model is as follows: and generating a column model related to the search term through historical behaviors of the current user, such as searching behavior, clicking behavior, comments, praise and the like, and obtaining whether the preferred column of the user is an instrument, information, data and the like. Specifically, the user preference column model is a calculation method that the historical behavior of the user is analyzed through the historical behavior of the current user of the user, the number of times of the column is the largest, and the retention time is the longest: and counting the times of entering each column and determining the stay time of each column together. And (3) calculating a rule: the individual column preference score is =50 (the number of times of the present column)/(the number of times of all the columns) +50 (the browsing time length of the present column)/(the browsing time length of all the columns), and the final score is used for sorting the columns.

The search term click preference column model is as follows: and generating column sequencing of the search term through the clicking behaviors of all users in all columns under the same search term of the platform. The click behavior may be a number of clicks or a click time interval.

It should be noted that when the columns are respectively sorted by the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model, a certain number of columns are obtained according to the sorting as the output result of each model, and the state that the columns appear in the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model according to the preset priority rule indicates that there is a related column or no related column in the model output result. And when the output result of the model has related columns, performing specific gravity addition of column score on the model. In the process of performing proportion addition on the column scores, if the column is added with 2 points in the output result of the user preference column model, the column is added with 4 points in the output result of the search term related column model, the column is added with 5 points in the output result of the search term click preference column model, and the column is added with 10 points in the output result of the grammar dependency relationship model.

In another preferred example, referring to fig. 3, the present application further discloses a word segmentation retrieval system applied to a single domain information retrieval platform, the system including:

the receiving module is used for receiving a search term input by a user;

the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;

and the output module is used for sorting the corpus documents according to the relevancy scores to generate a first retrieval result.

The method for calculating the single word relevancy by the single word calculation module comprises the following steps:

calculate the word q _i Inverse document frequency idf (q) _i )，

Wherein the content of the first and second substances,

i is a natural number N and is the total amount of the corpus document D;

k is a constant;

tf＝df _t /N；

| D | is the length of corpus document D;

avg ^Dl is the average length of corpus documents D.

The word segmentation retrieval system further comprises a column sorting module, wherein the column sorting module is used for sorting the columns respectively through a user preference column model, a retrieval word related column model, a retrieval word click preference column model and a grammar dependence relation model, grading each column respectively according to a preset priority rule, and finally sorting the columns according to the grading result to generate a second retrieval result.

And selecting a preset number of columns according to column scores of the second retrieval result and the sequence from top to bottom of the column scores, wherein the number of the corpus documents in each column is selected according to the sequence from top to bottom of the sequence of the corpus documents in the first retrieval result and the sequence of the corpus documents according to the sequence from top to bottom of the corpus documents according to the sequence rule of the corpus documents in each column and the number of the corpus documents in each column.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

When a user needs to search, search words are input through an instrument information network terminal, a preset number of columns are selected through a column sorting module, a preset number of corpus documents are selected through a receiving module, a word segmentation module, a single word calculation module, a relevance calculation module and an output module, and finally the selected columns and the corpus documents are returned to an instrument information network.

Furthermore, the setting positions of the receiving module, the word segmentation module, the single word calculation module, the correlation degree calculation module, the output module and the column sorting module are not limited uniquely in the application. Referring to fig. 3, in an example, the receiving module is disposed at the terminal of the instrument information network, and the word segmentation module, the single word calculation module, the correlation calculation module, the output module, and the column sorting module are all disposed in a server of the platform of the instrument information network. Referring to fig. 4, in another example, the receiving module, the word segmentation module and the column sorting module are all disposed in the instrumentation information network terminal to share data processing pressure of the instrumentation information network platform server through a processor of the instrumentation information network terminal, and the single word calculation module, the relevance calculation module and the output module are all disposed in the instrumentation information network platform server.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative. In addition, the shown or discussed couplings or direct couplings or data communication connections between each other may be through some interfaces.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is also to be understood that, in various embodiments of the present application, unless otherwise specified or conflicting in logic, terms and/or descriptions between different embodiments are consistent and may be mutually referenced, and technical features in different embodiments may be combined to form a new embodiment according to their inherent logical relationship.

The embodiments of the present invention are all preferred embodiments of the present application, and the protection scope of the present application is not limited thereby, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims

1. A word segmentation retrieval method is characterized in that: the method is applied to a single-domain information retrieval platform, and comprises the following steps:

receiving a search word input by a user;

carrying out single character word segmentation on the search word;

respectively calculating the single word relevancy of each corpus document;

calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result;

the preset weighting rule comprises the following steps: a business weighting rule and a relevancy weighting rule;

the service weighting rule comprises: the positions of the search terms appearing in the corpus documents, the times of the search terms appearing and the positions of the search terms appearing in different classification levels;

the relevancy weighting rule comprises the following steps: scoring the recall result according to the number of continuously hit search words in the corpus documents; the method for respectively calculating the single word relevancy of each corpus document comprises the following steps:

calculate the word q _i Inverse document frequency idf (q) of (1) _i )

Calculating the single word q _i Word relevancy in corpus document d

Wherein the content of the first and second substances,

f(q _i ，d)＝tf(q _i d) + Norm where Norm isThe field length is normalized;

i is a natural number, and N is the total amount of the corpus documents;

df _t for the appearance of a single word q _i The number of corpus documents;

k is a constant;

b is a preset parameter and is used for controlling the function of a field length normalization value, when the value of b is zero, normalization is forbidden, and when the value of b is 1, full normalization is started;

tf＝df _t /N；

i dl is the length of corpus document d;

avg ^dl is the average length of the corpus documents.

2. The word segmentation retrieval method according to claim 1, wherein the method further comprises:

and after the corpus documents are sorted according to the sum of the relevancy score and the special weighting score, acquiring a preset number of corpus documents according to a ranking sequence to generate a first retrieval result.

3. The word segmentation retrieval method according to claim 1, wherein the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.

4. The word segmentation retrieval method according to claim 3, wherein the preset ordering rule comprises:

according to a preset priority rule and the times of columns appearing in a user preference column model, a retrieval word related column model, a retrieval word click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;

5. A word segmentation retrieval system, comprising: applied to a single domain information retrieval platform, the system comprises:

the receiving module is used for receiving a search word input by a user;

the single character calculation module is used for calculating the single character relevancy of each corpus document;

the output module is used for calculating special weighting scores of the corpus documents according to a preset weighting rule, and sorting the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result;

the preset weighting rule comprises the following steps: business weighting rules and relevancy weighting rules

the relevancy weighting rule comprises the following steps: scoring the recall result according to the number of continuously hit search words in the corpus documents;

the method for calculating the single character relevancy by the single character calculation module comprises the following steps:

calculate the word q _i Reverse document frequency of

Calculating the single word q _i Word relevancy in corpus document d

f(q _i ，d)＝tf(q _i d) + Norm, norm being the field length normalized value;

i is a natural number, and N is the total amount of the corpus documents;

df _t for the appearance of a single word q _i The number of corpus documents;

k is a constant;

tf＝df _t /N；

i dl is the length of corpus document d;

avg ^dl is the average length of the corpus documents.

6. The word segmentation retrieval system as claimed in claim 5, wherein the system further comprises:

and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are sorted according to the sum of the relevancy score and the special weighting score, and generating the first retrieval result.