CN114153949A - Word segmentation retrieval method and system - Google Patents

Word segmentation retrieval method and system Download PDF

Info

Publication number
CN114153949A
CN114153949A CN202111512996.0A CN202111512996A CN114153949A CN 114153949 A CN114153949 A CN 114153949A CN 202111512996 A CN202111512996 A CN 202111512996A CN 114153949 A CN114153949 A CN 114153949A
Authority
CN
China
Prior art keywords
corpus
word
relevancy
retrieval
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111512996.0A
Other languages
Chinese (zh)
Other versions
CN114153949B (en
Inventor
付雪林
王涛
孙思遥
邓应来
王启超
吴邱思
安重阳
韩啸
张葳
曾明泉
唐海霞
赵鑫
刘成书
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xin Li Fang Technologies Inc
Original Assignee
Beijing Xin Li Fang Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xin Li Fang Technologies Inc filed Critical Beijing Xin Li Fang Technologies Inc
Priority to CN202111512996.0A priority Critical patent/CN114153949B/en
Publication of CN114153949A publication Critical patent/CN114153949A/en
Application granted granted Critical
Publication of CN114153949B publication Critical patent/CN114153949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a system for word segmentation retrieval. The method comprises the following steps: receiving a search word input by a user; carrying out single word segmentation on the search word; respectively calculating the single word relevancy of each corpus document; superposing the single word relevancy to generate relevancy scores of the corpus documents; and sorting the corpus documents according to the relevancy scores to generate a first retrieval result. In the single-domain information retrieval platform, the retrieval words are split in a single word segmentation mode, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted through relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.

Description

Word segmentation retrieval method and system
Technical Field
The present application relates to the field of search technologies, and in particular, to a method and a system for performing a word segmentation search.
Background
With the continuous development of the internet technology, various platforms are set up in the aspect of instrument information, so that users can retrieve various information about instruments through the platforms, including consultation in the vertical field, manufacturers, instruments, communities, data, network lecture halls, instrument currencies, recruitment, consumables, reagents, industrial applications, special subjects, market research and exhibition columns.
In a traditional instrument information platform, grammar dependency relationship configuration is generally performed on user search terms in a semantic template building mode to generate different search content sequences.
The instrument information platform has the characteristics of multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, if the aim of accurate hit is to be achieved in the retrieval process, the semantic template needs to be continuously maintained and updated with extremely high cost, and particularly when the user amount is continuously increased, more and more users search in the cross-field mode, so that the maintenance cost of the instrument information platform is further increased. The profitability of the instrument information platform is limited by the market served by the instrument information platform, and the requirement of the instrument information platform with increasing cost cannot be met, so that the maintenance of the traditional instrument information platform is low, and the retrieval hit rate is reduced.
Disclosure of Invention
In order to reduce the retrieval cost of an instrument information platform, the application aims to provide a participle retrieval method and a participle retrieval system.
The above application purpose of the present application is achieved by the following technical solutions:
in a first aspect, the present application provides a word segmentation retrieval method applied to a single-domain information retrieval platform, the method including:
receiving a search word input by a user;
carrying out single word segmentation on the search word;
respectively calculating the single word relevancy of each corpus document;
superposing the single word relevancy to generate relevancy scores of the corpus documents;
and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.
By adopting the technical scheme, in the single-field information retrieval platform, the retrieval words are split in a word segmentation mode of single words, the single word relevancy of each corpus document is calculated, and the corpus documents are sorted by relevancy scores generated by superposition of the single word relevancy. The retrieval process can accurately retrieve the single-domain information retrieval platform with multiple data structure types, small user amount, multiple user types, large industrial span and strong professional property, does not need to consume a manual combing semantic template, reduces the maintenance cost of the single-domain information retrieval platform, and simultaneously realizes the retrieval function of the single-domain information retrieval platform.
Further, the method further comprises:
and after the corpus documents are sorted according to the relevancy scores, acquiring a preset number of corpus documents according to a ranking sequence to generate the first retrieval result.
By adopting the technical scheme, under the condition of a plurality of data structure types, namely, a plurality of column types, the limitation of the preset number reduces the number of the corpus documents output at a single time, and the synchronous display of the corpus documents of a plurality of columns can be realized in an auxiliary manner.
Further, the method for respectively calculating the relevance of the single word of each corpus document comprises the following steps:
calculating single word
Figure DEST_PATH_IMAGE001
Reverse document frequency of
Figure 389968DEST_PATH_IMAGE002
) ;
Calculating the single character
Figure 876313DEST_PATH_IMAGE001
Word frequency in corpus document D
Figure DEST_PATH_IMAGE003
Calculating the single character
Figure 902431DEST_PATH_IMAGE001
Single word relevancy in corpus document D
Figure 157832DEST_PATH_IMAGE004
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE005
=
Figure 548886DEST_PATH_IMAGE006
+ Norm, Norm being a field length normalization value;
i is a natural number, and N is the total amount of the corpus document D;
Figure DEST_PATH_IMAGE007
for the appearance of single words
Figure 647161DEST_PATH_IMAGE001
Language ofThe number of documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
Figure 290020DEST_PATH_IMAGE008
=
Figure 665506DEST_PATH_IMAGE007
/N;
Figure DEST_PATH_IMAGE009
| D | is the length of the corpus document D;
Figure 224532DEST_PATH_IMAGE010
is the average length of the corpus document D.
By adopting the technical scheme, on the basis of the traditional if-idf analysis model, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of single word segmentation retrieval so as to meet the requirement of single field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.
Further, the method further comprises: after the word relevancy is superposed to generate the relevancy score of the corpus document,
and calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result.
Further, the preset weighting rule includes a service weighting rule and a relevancy weighting rule.
Further, the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.
Further, the preset ordering rule includes:
respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;
according to preset priority rules and the times of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;
and sorting the columns according to the column scores to generate a second retrieval result.
By adopting the technical scheme, the number of the columns is limited in a single field, and the columns are sorted in a mode of coexistence of multiple models, so that the user requirements can be more met, and meanwhile, the maintenance cost of each model is greatly reduced compared with the model maintenance cost when the model is used for sorting the corpus documents with large data volume.
In a second aspect, the present application provides a word segmentation retrieval system applied to a single domain information retrieval platform, the system comprising:
the receiving module is used for receiving a search term input by a user;
the word segmentation module is used for carrying out word segmentation on the single words of the search word;
the single word calculation module is used for calculating the single word relevancy of each corpus document;
the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;
and the output module is used for sequencing the corpus documents according to the relevancy scores to generate a first retrieval result.
Further, the system further comprises:
and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.
Further, the method for calculating the relevance of the single character by the single character calculation module comprises the following steps:
calculating single word
Figure 145522DEST_PATH_IMAGE001
Reverse document frequency of
Figure 792404DEST_PATH_IMAGE002
) ;
Calculating the single character
Figure 225659DEST_PATH_IMAGE001
Word frequency in corpus document D
Figure 565374DEST_PATH_IMAGE003
Calculating the single character
Figure 451290DEST_PATH_IMAGE001
Single word relevancy in corpus document D
Figure 904793DEST_PATH_IMAGE004
Wherein the content of the first and second substances,
Figure 458134DEST_PATH_IMAGE005
=
Figure 968750DEST_PATH_IMAGE006
+ Norm, Norm being a field length normalization value;
i is a natural number, and N is the total amount of the corpus document D;
Figure 341962DEST_PATH_IMAGE007
for the appearance of single words
Figure 330647DEST_PATH_IMAGE001
The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
Figure 207336DEST_PATH_IMAGE008
=
Figure 891783DEST_PATH_IMAGE007
/N;
Figure 752291DEST_PATH_IMAGE009
| D | is the length of the corpus document D;
Figure 279088DEST_PATH_IMAGE010
is the average length of the corpus document D.
In summary, the present application includes at least one of the following beneficial technical effects:
1. the maintenance cost of the single-field information retrieval platform is reduced, and the manual maintenance cost is saved in the aspect of retrieval of the corpus documents, so that the maintenance cost of the platform is reduced;
2. the hit rate of platform retrieval is improved, and the hit rate of a user in the process of using the platform retrieval is improved no matter a brand-new single word relevance calculation method or a matching mode of column sequencing and corpus document sequencing.
Drawings
Fig. 1 is a schematic flow chart of a word segmentation retrieval method according to the present application.
Fig. 2 is a flowchart illustrating a method for generating a second search result according to the present application.
Fig. 3 is a system diagram of an example of the present document participle search system.
Fig. 4 is a system diagram of another example of the present document participle search system.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiments of the present application will be described in further detail with reference to the drawings attached hereto.
The embodiment of the application provides a word segmentation retrieval method which is applied to a single-field information retrieval platform. The single-domain information retrieval platform refers to a retrieval platform in a certain limited domain, such as an instrument information platform, a medicine information platform, and the like, and the following embodiments only take the instrument information platform as an example to introduce the scheme of the present application, but do not limit the domain type.
In order to improve the hit rate of user retrieval, the single-field information retrieval platform divides a plurality of columns according to the vertical field of instrument information, and specifically comprises consultation, manufacturers, instruments, communities, data, network lectures, instrument curriculum, recruitment, consumables, reagents, industry applications, special subjects, market research and exhibition columns, wherein a corpus document library is stored in the instrument information platform, and the corpus documents are stored in a partition mode according to column types, so that a user can retrieve the corpus documents contained in the related columns in each column. Meanwhile, the instrument information platform is also provided with a total-station retrieval mode, namely, a user retrieves all corpus documents in the instrument information platform through retrieval words.
Referring to FIG. 1, in one example, a corpus document is retrieved in a single column as follows.
Step S101: and receiving a search word input by a user.
Specifically, the search term may be a sentence, a phrase, a single word, or a word composed of a plurality of single words, and the manner in which the instrument information platform receives the search term input by the user may be various, for example, the search term is input through a touch screen, the search term is input through voice, the search term is input through data transmission, or the search term is input through a keyboard, and accordingly, different search term input manners may be equipped with corresponding input devices, which is not limited uniquely herein.
Step S102: and carrying out single word segmentation on the search word. Specifically, the word segmentation means that each word in the search word input by the user is regarded as a segmentation word, and for example, the search word input by the user is regarded as "Qingdao Lubo", and the search word is divided into four words, i.e., "Qingdao", "island", "Luo", and "Bo".
Step S103: and respectively calculating the single word relevancy of each corpus document.
The method specifically comprises the following steps:
calculating single word
Figure 541442DEST_PATH_IMAGE001
Reverse document frequency of
Figure 393860DEST_PATH_IMAGE002
) (ii) a i is a natural number;
calculating the single character
Figure 479015DEST_PATH_IMAGE001
Word frequency in corpus document D
Figure 809502DEST_PATH_IMAGE003
Calculating the single character
Figure 926363DEST_PATH_IMAGE001
Single word relevancy in corpus document D
Figure 684103DEST_PATH_IMAGE004
Wherein the content of the first and second substances,
Figure 253625DEST_PATH_IMAGE005
=
Figure 387803DEST_PATH_IMAGE006
+ Norm, Norm being a field length normalization value;
i is a natural number, and N is the total amount of the corpus document D;
Figure 373819DEST_PATH_IMAGE007
for the appearance of single words
Figure 302460DEST_PATH_IMAGE001
The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
Figure 359278DEST_PATH_IMAGE008
=
Figure 31568DEST_PATH_IMAGE007
/N;
Figure DEST_PATH_IMAGE011
| D | is the length of the corpus document D;
Figure 985005DEST_PATH_IMAGE012
is the average length of the corpus document D.
The if-idf analysis model is introduced in the calculation process of the single-word relevancy. In the application, the calculation modes of the if value and the idf value are respectively improved, and the method is applied to the process of word segmentation retrieval of single words so as to meet the requirement of single-field retrieval. The method belongs to the technical combination of an if-idf analysis model and single word segmentation in a single field environment, not only simplifies the retrieval mode, but also improves the accuracy of retrieval hit.
Step S104: and overlapping the single word relevancy to generate the relevancy score of the corpus document.
Specifically, for a corpus document, after calculating the relevance of the individual character corresponding to each individual character, the relevance scores of the retrieval words relative to the corpus document can be obtained by adding the relevance of the individual characters forming the retrieval words. For example, taking the example that the search word input by the user is "Qingdao Lubo", the relevancy of the words "Qingdao", "island", "Ludao" and "Bo" of a corpus document is 536.26274, 789.53536, 841.99603 and 486.35306 respectively, and the relevancy score of the corpus document is 536.26274+789.53536+841.99603+ 486.35306.
Step S105: and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.
Specifically, when a user searches in a single column, because the number of corpus documents that can be simultaneously displayed in the single column is relatively large, the generated first search result is the result obtained after the corpus documents are sorted; when a user uses a total-station retrieval mode, the corpus documents in a plurality of columns need to be displayed simultaneously, the number of the corpus documents which can be displayed simultaneously in a single column is relatively small, and a preset number of corpus documents are acquired according to the ranking sequence to generate the first retrieval result
In another example, in step S104, after the single word relevance is superimposed to generate the relevance score of the corpus document, the special weighting score of the corpus document is calculated according to the preset weighting rule, and the corpus document is sorted according to the sum of the relevance score and the special weighting score to generate the first search result.
The preset weighting rules are used for carrying out secondary scoring on all the recalled expected documents and comprise business weighting rules and relevancy weighting rules.
And the service weighting rule represents that the recalling result is added according to a weighting rule preset by a user. For example, the positions of the search terms appearing in the corpus documents, the times of the search terms appearing, the positions of the search terms appearing in different classification levels, and the like all have different scoring values. The rule is a preset rule of the user, and is not described herein too much.
And the relevancy weighting rule represents that the recall result is added according to the number of the continuously hit search words in the corpus document. If all the hit single words are continuously hit, the score is highest, so that the corpus document score of all the continuously hit single words is highest; if the continuous part hits the single character, adding different scores to the document according to different numbers of the continuous part hit single characters, wherein the larger the number of the continuous part hit single characters is, the higher the score is.
If the search word is "Qingdao Lubo", for example, if the corpus document continuously hits on all four words of "Qingdao", "island", "way" and "bo", 10000 points are added to the corpus document, if only three words of "Qingdao", "island" and "way" are continuously hit, 50 points are added to the corpus document, if only two words of "Qingdao" and "island" are continuously hit, 30 points are added to the corpus document, and if only one word is hit, the corpus document is abandoned.
Further, when the user conducts retrieval in a total-station retrieval mode, the columns are sorted according to a preset sorting rule to generate a second retrieval result, and the first retrieval result and the second retrieval result are combined to be called a final retrieval result.
Referring to fig. 2, the method for sorting the columns according to the preset sorting rule to generate the second search result includes:
step S201: respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;
step S202: according to a preset priority rule and the state of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;
step S203: and sorting the columns according to the column scores to generate a second retrieval result.
The relevant column model of the search term is as follows: and calculating the similarity of the search word and the historical search data of each column by using a word vector model to generate column sequencing, wherein the mode is a known search model and is not expanded in detail.
The user preference column model is as follows: and generating a column model related to the search term through the historical behaviors of the current user, such as search behavior, click behavior, comment, like praise, and obtaining whether the column preferred by the user is an instrument, information, data, and the like. Specifically, the user preference column model is a calculation mode in which the historical behavior of the user is analyzed through the historical behavior of the current user of the user, the number of times of the column is the largest, and the retention time is the longest: and counting the times of entering each column and determining the stay time of each column together. And (3) calculating a rule: the individual column preference score is =50 (the number of times of the present column)/(the number of times of all the columns) +50 (the browsing time length of the present column)/(the browsing time length of all the columns), and the final score is used for sorting the columns.
The search term click preference column model is as follows: and generating column sequencing of the search term through the clicking behaviors of all users in all columns under the same search term of the platform. The click behavior may be a number of clicks or a click time interval.
It should be noted that when the columns are respectively sorted through the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model, a certain number of columns are obtained according to the sorting as the output result of each model, and the presence or absence of the related column in the model output result is indicated according to the preset priority rule and the state of the column in the user preference column model, the search term related column model, the search term click preference column model and the grammar dependency relationship model. And when the output result of the model has related columns, performing specific gravity addition of column score on the model. In the process of performing proportion addition on the column scores, if the column is added with 2 points in the output result of the user preference column model, the column is added with 4 points in the output result of the search term related column model, the column is added with 5 points in the output result of the search term click preference column model, and the column is added with 10 points in the output result of the grammar dependency relationship model.
Referring to fig. 3, in another preferred example, the present application further discloses a participle search system applied to a single domain information search platform, the system comprising:
the receiving module is used for receiving a search term input by a user;
the word segmentation module is used for carrying out word segmentation on the single words of the search word;
the single word calculation module is used for calculating the single word relevancy of each corpus document;
the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;
and the output module is used for sequencing the corpus documents according to the relevancy scores to generate a first retrieval result.
And the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.
The method for calculating the single character relevancy by the single character calculation module comprises the following steps:
calculating single word
Figure 350127DEST_PATH_IMAGE001
Reverse document frequency of
Figure 159820DEST_PATH_IMAGE002
) ;
Calculating the single character
Figure 635801DEST_PATH_IMAGE001
Word frequency in corpus document D
Figure 316181DEST_PATH_IMAGE003
Calculating the single character
Figure 586625DEST_PATH_IMAGE001
Single word relevancy in corpus document D
Figure 355386DEST_PATH_IMAGE004
Wherein the content of the first and second substances,
Figure 635057DEST_PATH_IMAGE005
=
Figure 435523DEST_PATH_IMAGE006
+ Norm, Norm being a field length normalization value;
i is a natural number N and is the total amount of the corpus document D;
Figure 876869DEST_PATH_IMAGE007
for the appearance of single words
Figure 395575DEST_PATH_IMAGE001
The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
Figure 950708DEST_PATH_IMAGE008
=
Figure 871260DEST_PATH_IMAGE007
/N;
Figure 749086DEST_PATH_IMAGE009
| D | is the length of the corpus document D;
Figure 223930DEST_PATH_IMAGE010
is the average length of the corpus document D.
The word segmentation retrieval system further comprises a column sorting module, wherein the column sorting module is used for sorting the columns respectively through a user preference column model, a retrieval word related column model, a retrieval word click preference column model and a grammar dependence relation model, grading each column respectively according to a preset priority rule, and finally sorting the columns according to the grading result to generate a second retrieval result.
And selecting a preset number of columns according to column scores of the second retrieval result and the sequence of the column scores from high to bottom, wherein the number of the corpus documents in each column is selected according to the sequence of the corpus documents in the first retrieval result and the sequence of the corpus documents from high to low according to the sequence of the corpus documents in each column.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system described above may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.
When a user needs to search, search words are input through an instrument information network terminal, a preset number of columns are selected through a column sorting module, a preset number of corpus documents are selected through a receiving module, a word segmentation module, a single word calculation module, a relevance calculation module and an output module, and finally the selected columns and the corpus documents are returned to an instrument information network.
Furthermore, the setting positions of the receiving module, the word segmentation module, the single word calculation module, the relevancy calculation module, the output module and the column sorting module are not limited uniquely in the application. Referring to fig. 3, in an example, the receiving module is disposed at an equipment information network terminal, and the word segmentation module, the single word calculation module, the correlation calculation module, the output module, and the column sorting module are disposed in a server of an equipment information network platform. Referring to fig. 4, in another example, the receiving module, the word segmentation module and the column sorting module are all disposed in the instrumentation information network terminal to share data processing pressure of the instrumentation information network platform server through a processor of the instrumentation information network terminal, and the single word calculation module, the relevance calculation module and the output module are all disposed in the instrumentation information network platform server.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the system embodiments described above are merely illustrative. In addition, the shown or discussed couplings or direct couplings or data communication connections between each other may be through some interfaces.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is also to be understood that the terminology and/or the description of the various embodiments herein is consistent and mutually inconsistent if no specific statement or logic conflicts exists, and that the technical features of the various embodiments may be combined to form new embodiments based on their inherent logical relationships.
The embodiments of the present invention are preferred embodiments of the present application, and the scope of protection of the present application is not limited by the embodiments, so: all equivalent changes made according to the structure, shape and principle of the present application shall be covered by the protection scope of the present application.

Claims (10)

1. A word segmentation retrieval method is characterized in that: the method is applied to a single-domain information retrieval platform, and comprises the following steps:
receiving a search word input by a user;
carrying out single word segmentation on the search word;
respectively calculating the single word relevancy of each corpus document;
superposing the single word relevancy to generate relevancy scores of the corpus documents;
and sorting the corpus documents according to the relevancy scores to generate a first retrieval result.
2. The word segmentation retrieval method according to claim 1, wherein the method further comprises:
and after the corpus documents are sorted according to the relevancy scores, acquiring a preset number of corpus documents according to a ranking sequence to generate the first retrieval result.
3. The word segmentation search method according to claim 1, wherein the method for calculating the relevance of each word of each corpus document comprises:
calculating single word
Figure 663707DEST_PATH_IMAGE001
Reverse document frequency of
Figure 738979DEST_PATH_IMAGE002
) ;
Calculating the single character
Figure 480539DEST_PATH_IMAGE001
Word frequency in corpus document D
Figure 426498DEST_PATH_IMAGE003
Calculating the single character
Figure 696942DEST_PATH_IMAGE001
Single word relevancy in corpus document D
Figure 743001DEST_PATH_IMAGE004
Wherein the content of the first and second substances,
Figure 22672DEST_PATH_IMAGE005
=
Figure 88717DEST_PATH_IMAGE006
+ Norm, Norm being a field length normalization value;
i is a natural number, and N is the total amount of the corpus document D;
Figure 530063DEST_PATH_IMAGE007
for the appearance of single words
Figure 48769DEST_PATH_IMAGE001
The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
Figure 603903DEST_PATH_IMAGE008
=
Figure 524454DEST_PATH_IMAGE007
/N;
Figure 402280DEST_PATH_IMAGE009
| D | is the length of the corpus document D;
Figure 142703DEST_PATH_IMAGE010
is the average length of the corpus document D.
4. The word segmentation retrieval method according to claim 1, wherein the method further comprises: after the word relevancy is superposed to generate the relevancy score of the corpus document,
and calculating special weighting scores of the corpus documents according to a preset weighting rule, and sequencing the corpus documents according to the sum of the relevancy scores and the special weighting scores to generate a first retrieval result.
5. The word segmentation retrieval method according to claim 4, wherein the preset weighting rules comprise business weighting rules and relevancy weighting rules.
6. The word segmentation retrieval method according to claim 1, wherein the method further comprises: and according to the content of the corpus documents, dividing the corpus documents into a plurality of columns, sequencing the columns according to a preset sequencing rule to generate a second retrieval result, and combining the first retrieval result and the second retrieval result into a final retrieval result.
7. The word segmentation retrieval method according to claim 6, wherein the preset ordering rule comprises:
respectively sequencing the columns through a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model;
according to preset priority rules and the times of columns appearing in a user preference column model, a search term related column model, a search term click preference column model and a grammar dependency relationship model, column scoring is carried out on the columns;
and sorting the columns according to the column scores to generate a second retrieval result.
8. A word segmentation retrieval system, comprising: applied to a single domain information retrieval platform, the system comprises:
the receiving module is used for receiving a search term input by a user;
the word segmentation module is used for carrying out word segmentation on the single words of the search word;
the single word calculation module is used for calculating the single word relevancy of each corpus document;
the relevancy calculation module is used for superposing the single word relevancy to generate relevancy scores of the corpus documents;
and the output module is used for sequencing the corpus documents according to the relevancy scores to generate a first retrieval result.
9. The word segmentation retrieval system of claim 8, wherein the system further comprises:
and the output module is used for obtaining a preset number of corpus documents according to the ranking sequence after the corpus documents are ranked according to the relevancy scores to generate the first retrieval result.
10. The word segmentation retrieval system of claim 8, wherein the method for calculating the word relevancy by the word calculation module comprises:
calculating single word
Figure 498598DEST_PATH_IMAGE001
Reverse document frequency of
Figure 276586DEST_PATH_IMAGE002
) ;
Calculating the single character
Figure 325313DEST_PATH_IMAGE001
Word frequency in corpus document D
Figure 818612DEST_PATH_IMAGE003
Calculating the single character
Figure 243777DEST_PATH_IMAGE001
Single word relevancy in corpus document D
Figure 607762DEST_PATH_IMAGE004
Wherein the content of the first and second substances,
Figure 92970DEST_PATH_IMAGE005
=
Figure 76494DEST_PATH_IMAGE006
+ Norm, Norm being a field length normalization value;
i is a natural number, and N is the total amount of the corpus document D;
Figure 774192DEST_PATH_IMAGE007
for the appearance of single words
Figure 523842DEST_PATH_IMAGE001
The number of corpus documents D;
k is a constant;
b is a preset parameter and is used for controlling the function of a field length normalization value, the normalization is forbidden when the value of b is zero, and the complete normalization is started when the value of b is 1;
Figure 914372DEST_PATH_IMAGE008
=
Figure 851104DEST_PATH_IMAGE007
/N;
Figure 874862DEST_PATH_IMAGE009
| D | is the length of the corpus document D;
Figure 479018DEST_PATH_IMAGE010
is the average length of the corpus document D.
CN202111512996.0A 2021-12-11 2021-12-11 Word segmentation retrieval method and system Active CN114153949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111512996.0A CN114153949B (en) 2021-12-11 2021-12-11 Word segmentation retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111512996.0A CN114153949B (en) 2021-12-11 2021-12-11 Word segmentation retrieval method and system

Publications (2)

Publication Number Publication Date
CN114153949A true CN114153949A (en) 2022-03-08
CN114153949B CN114153949B (en) 2022-12-13

Family

ID=80450574

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111512996.0A Active CN114153949B (en) 2021-12-11 2021-12-11 Word segmentation retrieval method and system

Country Status (1)

Country Link
CN (1) CN114153949B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0520359A (en) * 1991-07-11 1993-01-29 Nippon Telegr & Teleph Corp <Ntt> Information retrieval system
CN102567376A (en) * 2010-12-16 2012-07-11 中国移动通信集团浙江有限公司 Method and device for recommending personalized search results
CN106095848A (en) * 2016-06-02 2016-11-09 北京奇虎科技有限公司 The method of text association, terminal unit and corresponding server unit
CN111538830A (en) * 2020-04-28 2020-08-14 清华大学 French retrieval method, French retrieval device, computer equipment and storage medium
US20210209482A1 (en) * 2020-09-24 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for verifying accuracy of judgment result, electronic device and medium
CN113112164A (en) * 2021-04-19 2021-07-13 特变电工股份有限公司新疆变压器厂 Transformer fault diagnosis method and device based on knowledge graph and electronic equipment
CN113486140A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Knowledge question-answer matching method, device, equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0520359A (en) * 1991-07-11 1993-01-29 Nippon Telegr & Teleph Corp <Ntt> Information retrieval system
CN102567376A (en) * 2010-12-16 2012-07-11 中国移动通信集团浙江有限公司 Method and device for recommending personalized search results
CN106095848A (en) * 2016-06-02 2016-11-09 北京奇虎科技有限公司 The method of text association, terminal unit and corresponding server unit
CN111538830A (en) * 2020-04-28 2020-08-14 清华大学 French retrieval method, French retrieval device, computer equipment and storage medium
US20210209482A1 (en) * 2020-09-24 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for verifying accuracy of judgment result, electronic device and medium
CN113112164A (en) * 2021-04-19 2021-07-13 特变电工股份有限公司新疆变压器厂 Transformer fault diagnosis method and device based on knowledge graph and electronic equipment
CN113486140A (en) * 2021-07-27 2021-10-08 平安国际智慧城市科技股份有限公司 Knowledge question-answer matching method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘琼茹: "基于Lucene的搜索排序算法研究与实现", 《无线互联科技》 *

Also Published As

Publication number Publication date
CN114153949B (en) 2022-12-13

Similar Documents

Publication Publication Date Title
US11868386B2 (en) Method and system for sentiment analysis of information
US9613024B1 (en) System and methods for creating datasets representing words and objects
US9317559B1 (en) Sentiment detection as a ranking signal for reviewable entities
RU2487403C1 (en) Method of constructing semantic model of document
JP3882048B2 (en) Question answering system and question answering processing method
US20160364656A1 (en) Methods and systems for knowledge discovery
CN111324771B (en) Video tag determination method and device, electronic equipment and storage medium
EP1577797A2 (en) Rendering tables with natural language commands
CN109710935B (en) Museum navigation and knowledge recommendation method based on cultural relic knowledge graph
KR20080021017A (en) Comparing text based documents
US20090063132A1 (en) Information Processing Apparatus, Information Processing Method, and Program
CN111414763A (en) Semantic disambiguation method, device, equipment and storage device for sign language calculation
KR20130036863A (en) Document classifying system and method using semantic feature
CN111274358A (en) Text processing method and device, electronic equipment and storage medium
CN111221968A (en) Author disambiguation method and device based on subject tree clustering
Ercan et al. Anlamver: Semantic model evaluation dataset for turkish-word similarity and relatedness
CN113190593A (en) Search recommendation method based on digital human knowledge graph
Chinkina et al. Online information retrieval for language learning
CN110516062B (en) Method and device for searching and processing document
JP2008243024A (en) Information acquisition device, program therefor and method
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
JP2010092357A (en) Facility-related information retrieval method and facility-related information retrieval system
CN112559711A (en) Synonymous text prompting method and device and electronic equipment
Abimbola et al. A noun-centric keyphrase extraction model: Graph-based approach
CN110688559A (en) Retrieval method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant