CN114153949A

CN114153949A - Word segmentation retrieval method and system

Info

Publication number: CN114153949A
Application number: CN202111512996.0A
Authority: CN
Inventors: 付雪林; 王涛; 孙思遥; 邓应来; 王启超; 吴邱思; 安重阳; 韩啸; 张葳; 曾明泉; 唐海霞; 赵鑫; 刘成书
Original assignee: Beijing Xin Li Fang Technologies Inc
Current assignee: Beijing Xin Li Fang Technologies Inc
Priority date: 2021-12-11
Filing date: 2021-12-11
Publication date: 2022-03-08
Anticipated expiration: 2041-12-11
Also published as: CN114153949B

Abstract

The application provides a method and a system for word segmentation retrieval. The method comprises the following steps: receiving a search word input by a user; carrying out single word segmentation on the search word; respectively calculating the single word relevancy of each corpus document; superposing the single word relevancy to generate relevancy scores of the corpus documents; and sorting the corpus documents according to the relevancy scores to generate a first retrieval result. In the single-domain information retrieval platform, the retrieval words are split in a single word segmentation mode, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted through relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.

Description

Word segmentation retrieval method and system

Technical Field

The present application relates to the field of search technologies, and in particular, to a method and a system for performing a word segmentation search.

Background

With the continuous development of the internet technology, various platforms are set up in the aspect of instrument information, so that users can retrieve various information about instruments through the platforms, including consultation in the vertical field, manufacturers, instruments, communities, data, network lecture halls, instrument currencies, recruitment, consumables, reagents, industrial applications, special subjects, market research and exhibition columns.

In a traditional instrument information platform, grammar dependency relationship configuration is generally performed on user search terms in a semantic template building mode to generate different search content sequences.

The instrument information platform has the characteristics of multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, if the aim of accurate hit is to be achieved in the retrieval process, the semantic template needs to be continuously maintained and updated with extremely high cost, and particularly when the user amount is continuously increased, more and more users search in the cross-field mode, so that the maintenance cost of the instrument information platform is further increased. The profitability of the instrument information platform is limited by the market served by the instrument information platform, and the requirement of the instrument information platform with increasing cost cannot be met, so that the maintenance of the traditional instrument information platform is low, and the retrieval hit rate is reduced.

Disclosure of Invention

In order to reduce the retrieval cost of an instrument information platform, the application aims to provide a participle retrieval method and a participle retrieval system.

The above application purpose of the present application is achieved by the following technical solutions:

in a first aspect, the present application provides a word segmentation retrieval method applied to a single-domain information retrieval platform, the method including:

receiving a search word input by a user;

carrying out single word segmentation on the search word;

respectively calculating the single word relevancy of each corpus document;

superposing the single word relevancy to generate relevancy scores of the corpus documents;

and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.

By adopting the technical scheme, in the single-field information retrieval platform, the retrieval words are split in a word segmentation mode of single words, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted by relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.

Further, the method further comprises:

and after the corpus documents are sorted according to the relevancy scores, acquiring a preset number of corpus documents according to a ranking sequence to generate the first retrieval result.

By adopting the technical scheme, under the condition of a plurality of data structure types, namely, a plurality of column types, the limitation of the preset number reduces the number of the corpus documents output at a single time, and the synchronous display of the corpus documents of a plurality of columns can be realized in an auxiliary manner.

Further, the method for respectively calculating the relevance of the single word of each corpus document comprises the following steps:

calculating single word

Reverse document frequency of

) ；

Calculating the single character

Word frequency in corpus document D

；

Calculating the single character

Single word relevancy in corpus document D

；

Wherein the content of the first and second substances,

=

+ Norm, Norm being a field length normalization value;

i is a natural number, and N is the total amount of the corpus document D;

for the appearance of single words

Language ofThe number of documents D;

k is a constant;

b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;

=

/N；

，

| D | is the length of the corpus document D;

is the average length of the corpus document D.

By adopting the technical scheme, on the basis of the traditional if-idf analysis model, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of single word segmentation retrieval so as to meet the requirement of single field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.

Further, the method further comprises: after the word relevancy is superposed to generate the relevancy score of the corpus document,

and calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result.

Further, the preset weighting rule includes a service weighting rule and a relevancy weighting rule.

Further, the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.

Further, the preset ordering rule includes:

respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;

according to preset priority rules and the times of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;

and sorting the columns according to the column scores to generate a second retrieval result.

By adopting the technical scheme, the number of the columns is limited in a single field, and the columns are sorted in a mode of coexistence of multiple models, so that the user requirements can be more met, and meanwhile, the maintenance cost of each model is greatly reduced compared with the model maintenance cost when the model is used for sorting the corpus documents with large data volume.

In a second aspect, the present application provides a word segmentation retrieval system applied to a single domain information retrieval platform, the system comprising:

the receiving module is used for receiving a search term input by a user;

the word segmentation module is used for carrying out word segmentation on the single words of the search word;

the single word calculation module is used for calculating the single word relevancy of each corpus document;

the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;

and the output module is used for sequencing the corpus documents according to the relevancy scores to generate a first retrieval result.

Further, the system further comprises:

and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.

Further, the method for calculating the relevance of the single character by the single character calculation module comprises the following steps:

calculating single word

Reverse document frequency of

) ；

Calculating the single character

Word frequency in corpus document D

；

Calculating the single character

Single word relevancy in corpus document D

；

Wherein the content of the first and second substances,

=

+ Norm, Norm being a field length normalization value;

i is a natural number, and N is the total amount of the corpus document D;

for the appearance of single words

The number of corpus documents D;

k is a constant;

=

/N；

，

| D | is the length of the corpus document D;

is the average length of the corpus document D.

In summary, the present application includes at least one of the following beneficial technical effects:

1. the maintenance cost of the single-field information retrieval platform is reduced, and the manual maintenance cost is saved in the aspect of retrieval of the corpus documents, so that the maintenance cost of the platform is reduced;

2. the hit rate of platform retrieval is improved, and the hit rate of a user in the process of using the platform retrieval is improved no matter a brand-new single word relevance calculation method or a matching mode of column sequencing and corpus document sequencing.

Drawings

Fig. 1 is a schematic flow chart of a word segmentation retrieval method according to the present application.

Fig. 2 is a flowchart illustrating a method for generating a second search result according to the present application.

Fig. 3 is a system diagram of an example of the present document participle search system.

Fig. 4 is a system diagram of another example of the present document participle search system.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.

The embodiment of the application provides a word segmentation retrieval method which is applied to a single-field information retrieval platform. The single-domain information retrieval platform refers to a retrieval platform in a certain limited domain, such as an instrument information platform, a medicine information platform, and the like, and the following embodiments only take the instrument information platform as an example to introduce the scheme of the present application, but do not limit the domain type.

In order to improve the hit rate of user retrieval, the single-field information retrieval platform divides a plurality of columns according to the vertical field of instrument information, and specifically comprises consultation, manufacturers, instruments, communities, data, network lectures, instrument curriculum, recruitment, consumables, reagents, industry applications, special subjects, market research and exhibition columns, wherein a corpus document library is stored in the instrument information platform, and the corpus documents are stored in a partition mode according to column types, so that a user can retrieve the corpus documents contained in the related columns in each column. Meanwhile, the instrument information platform is also provided with a total-station retrieval mode, namely, a user retrieves all corpus documents in the instrument information platform through retrieval words.

Referring to FIG. 1, in one example, a corpus document is retrieved in a single column as follows.

Step S101: and receiving a search word input by a user.

Specifically, the search term may be a sentence, a phrase, a single word, or a word composed of a plurality of single words, and the manner in which the instrument information platform receives the search term input by the user may be various, for example, the search term is input through a touch screen, the search term is input through voice, the search term is input through data transmission, or the search term is input through a keyboard, and accordingly, different search term input manners may be equipped with corresponding input devices, which is not limited uniquely herein.

Step S102: and carrying out single word segmentation on the search word. Specifically, the word segmentation means that each word in the search word input by the user is regarded as a segmentation word, and for example, the search word input by the user is regarded as "Qingdao Lubo", and the search word is divided into four words, i.e., "Qingdao", "island", "Luo", and "Bo".

Step S103: and respectively calculating the single word relevancy of each corpus document.

The method specifically comprises the following steps:

calculating single word

Reverse document frequency of

) (ii) a i is a natural number;

calculating the single character

Word frequency in corpus document D

；

Calculating the single character

Single word relevancy in corpus document D

；

Wherein the content of the first and second substances,

=

+ Norm, Norm being a field length normalization value;

i is a natural number, and N is the total amount of the corpus document D;

for the appearance of single words

The number of corpus documents D;

k is a constant;

=

/N；

，

| D | is the length of the corpus document D;

is the average length of the corpus document D.

The if-idf analysis model is introduced in the calculation process of the single-word relevancy. In the application, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of word segmentation retrieval of single words so as to meet the requirement of single-field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.

Step S104: and overlapping the single word relevancy to generate the relevancy score of the corpus document.

Specifically, for a corpus document, after calculating the relevance of the individual character corresponding to each individual character, the relevance scores of the retrieval words relative to the corpus document can be obtained by adding the relevance of the individual characters forming the retrieval words. For example, taking the example that the search word input by the user is "Qingdao Lubo", the relevancy of the words "Qingdao", "island", "Ludao" and "Bo" of a corpus document is 536.26274, 789.53536, 841.99603 and 486.35306 respectively, and the relevancy score of the corpus document is 536.26274+789.53536+841.99603+ 486.35306.

Step S105: and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.

Specifically, when a user searches in a single column, because the number of corpus documents that can be simultaneously displayed in the single column is relatively large, the generated first search result is the result obtained after the corpus documents are sorted; when a user uses a total-station retrieval mode, the corpus documents in a plurality of columns need to be displayed simultaneously, the number of the corpus documents which can be displayed simultaneously in a single column is relatively small, and a preset number of corpus documents are acquired according to the ranking sequence to generate the first retrieval result

In another example, in step S104, after the single word relevance is superimposed to generate the relevance score of the corpus document, the special weighting score of the corpus document is calculated according to the preset weighting rule, and the corpus document is sorted according to the sum of the relevance score and the special weighting score to generate the first search result.

The preset weighting rules are used for carrying out secondary scoring on all the recalled expected documents and comprise business weighting rules and relevancy weighting rules.

And the service weighting rule represents that the recalling result is added according to a weighting rule preset by a user. For example, the positions of the search terms appearing in the corpus documents, the times of the search terms appearing, the positions of the search terms appearing in different classification levels, and the like all have different scoring values. The rule is a preset rule of the user, and is not described herein too much.

And the relevancy weighting rule represents that the recall result is added according to the number of the continuously hit search words in the corpus document. If all the hit single words are continuously hit, the score is highest, so that the corpus document score of all the continuously hit single words is highest; if the continuous part hits the single character, adding different scores to the document according to different numbers of the continuous part hit single characters, wherein the larger the number of the continuous part hit single characters is, the higher the score is.

If the search word is "Qingdao Lubo", for example, if the corpus document continuously hits on all four words of "Qingdao", "island", "way" and "bo", 10000 points are added to the corpus document, if only three words of "Qingdao", "island" and "way" are continuously hit, 50 points are added to the corpus document, if only two words of "Qingdao" and "island" are continuously hit, 30 points are added to the corpus document, and if only one word is hit, the corpus document is abandoned.

Further, when the user conducts retrieval in a total-station retrieval mode, the columns are sorted according to a preset sorting rule to generate a second retrieval result, and the first retrieval result and the second retrieval result are combined to be called a final retrieval result.

Referring to fig. 2, the method for sorting the columns according to the preset sorting rule to generate the second search result includes:

step S201: respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;

step S202: according to a preset priority rule and the state of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;

step S203: and sorting the columns according to the column scores to generate a second retrieval result.

The relevant column model of the search term is as follows: and calculating the similarity of the search word and the historical search data of each column by using a word vector model to generate column sequencing, wherein the mode is a known search model and is not expanded in detail.

The user preference column model is as follows: and generating a column model related to the search term through the historical behaviors of the current user, such as search behavior, click behavior, comment, like praise, and obtaining whether the column preferred by the user is an instrument, information, data, and the like. Specifically, the user preference column model is a calculation mode in which the historical behavior of the user is analyzed through the historical behavior of the current user of the user, the number of times of the column is the largest, and the retention time is the longest: and counting the times of entering each column and determining the stay time of each column together. And (3) calculating a rule: the individual column preference score is =50 (the number of times of the present column)/(the number of times of all the columns) +50 (the browsing time length of the present column)/(the browsing time length of all the columns), and the final score is used for sorting the columns.

The search term click preference column model is as follows: and generating column sequencing of the search term through the clicking behaviors of all users in all columns under the same search term of the platform. The click behavior may be a number of clicks or a click time interval.

It should be noted that when the columns are respectively sorted through the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model, a certain number of columns are obtained according to the sorting as the output result of each model, and the presence or absence of the related column in the model output result is indicated according to the preset priority rule and the state of the column in the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model. And when the output result of the model has related columns, performing specific gravity addition of column score on the model. In the process of performing proportion addition on the column scores, if the column is added with 2 points in the output result of the user preference column model, the column is added with 4 points in the output result of the search term related column model, the column is added with 5 points in the output result of the search term click preference column model, and the column is added with 10 points in the output result of the grammar dependency relationship model.

Referring to fig. 3, in another preferred example, the present application further discloses a participle search system applied to a single domain information search platform, the system comprising:

the receiving module is used for receiving a search term input by a user;

The method for calculating the single character relevancy by the single character calculation module comprises the following steps:

calculating single word

Reverse document frequency of

) ；

Calculating the single character

Word frequency in corpus document D

；

Calculating the single character

Single word relevancy in corpus document D

；

Wherein the content of the first and second substances,

=

+ Norm, Norm being a field length normalization value;

i is a natural number N and is the total amount of the corpus document D;

for the appearance of single words

The number of corpus documents D;

k is a constant;

=

/N；

，

| D | is the length of the corpus document D;

is the average length of the corpus document D.

The word segmentation retrieval system further comprises a column sorting module, wherein the column sorting module is used for sorting the columns respectively through a user preference column model, a retrieval word related column model, a retrieval word click preference column model and a grammar dependence relation model, grading each column respectively according to a preset priority rule, and finally sorting the columns according to the grading result to generate a second retrieval result.

And selecting a preset number of columns according to column scores of the second retrieval result and the sequence of the column scores from high to bottom, wherein the number of the corpus documents in each column is selected according to the sequence of the corpus documents in the first retrieval result and the sequence of the corpus documents from high to low according to the sequence of the corpus documents in each column.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

When a user needs to search, search words are input through an instrument information network terminal, a preset number of columns are selected through a column sorting module, a preset number of corpus documents are selected through a receiving module, a word segmentation module, a single word calculation module, a relevance calculation module and an output module, and finally the selected columns and the corpus documents are returned to an instrument information network.

Furthermore, the setting positions of the receiving module, the word segmentation module, the single word calculation module, the relevancy calculation module, the output module and the column sorting module are not limited uniquely in the application. Referring to fig. 3, in an example, the receiving module is disposed at an equipment information network terminal, and the word segmentation module, the single word calculation module, the correlation calculation module, the output module, and the column sorting module are disposed in a server of an equipment information network platform. Referring to fig. 4, in another example, the receiving module, the word segmentation module and the column sorting module are all disposed in the instrumentation information network terminal to share data processing pressure of the instrumentation information network platform server through a processor of the instrumentation information network terminal, and the single word calculation module, the relevance calculation module and the output module are all disposed in the instrumentation information network platform server.

In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative. In addition, the shown or discussed couplings or direct couplings or data communication connections between each other may be through some interfaces.

Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is also to be understood that the terminology and/or the description of the various embodiments herein is consistent and mutually inconsistent if no specific statement or logic conflicts exists, and that the technical features of the various embodiments may be combined to form new embodiments based on their inherent logical relationships.

The embodiments of the present invention are preferred embodiments of the present application, and the scope of protection of the present application is not limited by the embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims

1. A word segmentation retrieval method is characterized in that: the method is applied to a single-domain information retrieval platform, and comprises the following steps:

receiving a search word input by a user;

carrying out single word segmentation on the search word;

respectively calculating the single word relevancy of each corpus document;

2. The word segmentation retrieval method according to claim 1, wherein the method further comprises:

3. The word segmentation search method according to claim 1, wherein the method for calculating the relevance of each word of each corpus document comprises:

calculating single word

Reverse document frequency of

) ；

Calculating the single character

Word frequency in corpus document D

；

Calculating the single character

Single word relevancy in corpus document D

；

Wherein the content of the first and second substances,

=

+ Norm, Norm being a field length normalization value;

i is a natural number, and N is the total amount of the corpus document D;

for the appearance of single words

The number of corpus documents D;

k is a constant;

=

/N；

，

| D | is the length of the corpus document D;

is the average length of the corpus document D.

4. The word segmentation retrieval method according to claim 1, wherein the method further comprises: after the word relevancy is superposed to generate the relevancy score of the corpus document,

5. The word segmentation retrieval method according to claim 4, wherein the preset weighting rules comprise business weighting rules and relevancy weighting rules.

6. The word segmentation retrieval method according to claim 1, wherein the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.

7. The word segmentation retrieval method according to claim 6, wherein the preset ordering rule comprises:

8. A word segmentation retrieval system, comprising: applied to a single domain information retrieval platform, the system comprises:

the receiving module is used for receiving a search term input by a user;

9. The word segmentation retrieval system of claim 8, wherein the system further comprises:

10. The word segmentation retrieval system of claim 8, wherein the method for calculating the word relevancy by the word calculation module comprises:

calculating single word