CN112307183A - Search data identification method and device, electronic equipment and computer storage medium - Google Patents

Search data identification method and device, electronic equipment and computer storage medium Download PDF

Info

Publication number
CN112307183A
CN112307183A CN202011191952.8A CN202011191952A CN112307183A CN 112307183 A CN112307183 A CN 112307183A CN 202011191952 A CN202011191952 A CN 202011191952A CN 112307183 A CN112307183 A CN 112307183A
Authority
CN
China
Prior art keywords
search
candidate
preset
characteristic information
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011191952.8A
Other languages
Chinese (zh)
Other versions
CN112307183B (en
Inventor
孙健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jindi Credit Service Co ltd
Original Assignee
Beijing Jindi Credit Service Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jindi Credit Service Co ltd filed Critical Beijing Jindi Credit Service Co ltd
Priority to CN202011191952.8A priority Critical patent/CN112307183B/en
Publication of CN112307183A publication Critical patent/CN112307183A/en
Application granted granted Critical
Publication of CN112307183B publication Critical patent/CN112307183B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9532Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3322Query formulation using system suggestions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The disclosure relates to a search data identification method, a search data identification device, an electronic device and a storage medium. Wherein, the method comprises the following steps: responding to an input data search request, and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition; performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining multiple groups of rewriting candidate words corresponding to the first characteristic information; sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set; and acquiring first recall result data according to the candidate search set. The method and the device can improve the accuracy of the search result, and the feedback is closer to the search result expected by the user.

Description

Search data identification method and device, electronic equipment and computer storage medium
Technical Field
The present disclosure relates to the field of data processing, and in particular, to a search data identification method, apparatus, electronic device, and computer storage medium.
Background
In the application scene of information search, when a user uses a search word which is listened to and spoken or unfamiliar to search information, pinyin or short writing of the pinyin is usually used as a search expression mode with higher probability, and if the specific word is not clear during searching the 'sky-eye search', the 'tianyan search' is used for searching; or when the user is in a hurry to input or the pinyin input method does not provide a correct candidate word, the user is more inclined to directly input uncertain pinyin or pinyin shorthand or incomplete pinyin segments, such as 'wuhan Ji quiet wuz' ('wuhan Ji quiet materials'), 'china postal express logistics share limited g' ('china postal express logistics share limited') and the like. If incomplete search expressions with pinyin are similar, the true search intention is difficult to identify, and most search results are returned without results or the returned results are not accurate enough and deviate from the actual results expected by the user.
Accordingly, there is a need for one or more methods to address the above-mentioned problems.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
An object of the present disclosure is to provide a search data identification method, apparatus, electronic device, and computer-readable storage medium, which overcome one or more of the problems due to the limitations and disadvantages of the related art, at least to some extent.
According to an aspect of the present disclosure, there is provided a search data identification method including:
responding to an input data search request, and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition;
when first characteristic information meeting preset characteristic conditions is included, performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining multiple groups of rewriting candidate words corresponding to the first characteristic information;
sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
and acquiring first recall result data according to the candidate search set.
In an exemplary embodiment of the present disclosure, the analyzing whether the search term in the data search request includes first feature information that satisfies a preset feature condition includes:
detecting whether a search word in the data search request comprises a pinyin syllable segment;
and if the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting a preset characteristic condition, wherein the pinyin syllable segment is the first characteristic information.
In an exemplary embodiment of the present disclosure, the sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set includes:
calculating the scores of all groups of rewritten candidate words according to a preset sorting algorithm to obtain a score result;
and sorting the rewriting candidate words according to the grading result to obtain a sorted candidate search set.
In an exemplary embodiment of the present disclosure, the multiple groups of rewriting candidate words are ranked according to a preset ranking algorithm, so as to obtain a ranked candidate search set, where the ranked candidate search set includes any one or more of the following items:
judging the number of independent syllables of each rewritten candidate word; sorting each group of rewritten candidate words according to the number of independent syllables to obtain a sorted candidate search set;
alternatively, the first and second electrodes may be,
determining syllable prefix matching degree of each rewriting candidate word; and sequencing all the groups of rewritten candidate words according to the syllable prefix matching degree to obtain a sequenced candidate search set.
In an exemplary embodiment of the present disclosure, after obtaining a plurality of groups of rewrite candidate words corresponding to the first feature information, the method further includes:
acquiring fuzzy search result data according to each group of candidate rewriting words;
determining the word segmentation frequency of each group of candidate rewriting words in fuzzy search result data;
sorting the multiple groups of rewriting candidate words according to a preset sorting algorithm to obtain a sorted candidate search set, including:
and sequencing all groups of rewritten candidate words according to the word segmentation frequency to obtain a sequenced candidate search set.
In an exemplary embodiment of the present disclosure, the sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set includes:
performing confusion calculation on the multiple groups of rewritten candidate words after word segmentation processing to obtain confusion scores;
and performing ascending sorting on the rewriting candidate words with the lowest confusion score in a preset number according to the confusion score to obtain a sorted candidate search set.
In an exemplary embodiment of the present disclosure, performing word segmentation processing on the first feature information according to a preset word segmentation policy includes:
segmenting the first characteristic information according to an initial consonant and final consonant comparison table; alternatively, the first and second electrodes may be,
and performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary.
In an exemplary embodiment of the present disclosure, the method further comprises:
acquiring second characteristic information included in a search word in the data search request;
extracting word granularity and phrase granularity of the second characteristic information;
acquiring second recall result data corresponding to second characteristic information according to the word granularity and the phrase granularity of the second characteristic information;
and taking the first recall result data and the second recall result data as response information of the data search request.
In one aspect of the present disclosure, there is provided a search data identification apparatus including:
the characteristic analysis module is used for responding to an input data search request and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition;
the word segmentation processing module is used for performing word segmentation processing on first characteristic information according to a preset word segmentation strategy when the first characteristic information meets a preset characteristic condition, and obtaining a plurality of groups of rewriting candidate words corresponding to the first characteristic information;
the candidate word sorting module is used for sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
and the result recalling module is used for acquiring first recall result data according to the candidate search set.
In one aspect of the present disclosure, there is provided an electronic device including:
a processor; and
a memory having computer readable instructions stored thereon which, when executed by the processor, implement a method according to any of the above.
In an aspect of the disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, realizes the method according to any one of the above.
A search data identification method in an exemplary embodiment of the present disclosure includes: responding to an input data search request, and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition; performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining multiple groups of rewriting candidate words corresponding to the first characteristic information; sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set; and acquiring first recall result data according to the candidate search set. The method and the device can improve the accuracy of the search result, and the feedback is closer to the search result expected by the user.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features and advantages of the present disclosure will become more apparent by describing in detail exemplary embodiments thereof with reference to the attached drawings.
FIG. 1 shows a flow diagram of a search data identification method according to an example embodiment of the present disclosure;
FIG. 2 shows a schematic block diagram of a search data identification apparatus according to an example embodiment of the present disclosure;
FIG. 3 schematically illustrates a block diagram of an electronic device according to an exemplary embodiment of the present disclosure; and
fig. 4 schematically illustrates a schematic diagram of a computer-readable storage medium according to an exemplary embodiment of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The same reference numerals denote the same or similar parts in the drawings, and thus, a repetitive description thereof will be omitted.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, materials, devices, steps, and so forth. In other instances, well-known structures, methods, devices, implementations, materials, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. That is, these functional entities may be implemented in the form of software, or in one or more software-hardened modules, or in different networks and/or processor devices and/or microcontroller devices.
In the present exemplary embodiment, there is first provided a search data identification method; referring to fig. 1, the search data recognition method may include the steps of:
step S110, responding to an input data search request, and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition;
step S120, when first characteristic information meeting preset characteristic conditions is included, performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining multiple groups of rewriting candidate words corresponding to the first characteristic information;
step S130, sorting the multiple groups of rewriting candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
step S140, obtaining first recall result data according to the candidate search set.
A search data identification method in an exemplary embodiment of the present disclosure includes: responding to an input data search request, and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition; performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining multiple groups of rewriting candidate words corresponding to the first characteristic information; sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set; and acquiring first recall result data according to the candidate search set. The method and the device can improve the accuracy of the search result, and the feedback is closer to the search result expected by the user.
Next, the search data identification method in the present exemplary embodiment will be further explained.
In step S110, it may be analyzed whether a search word in the data search request includes first feature information satisfying a preset feature condition in response to the input data search request.
In the existing search scenario, the input content is often a combination of chinese characters and pinyin letters or a shorthand of chinese characters and pinyin letters due to the user input problem or uncertain user input content, and in the existing search algorithm, a huge pinyin prefix tree is generally established to identify the pinyin letters, but for some specific searches, for example, the search content is generally specific proper nouns such as company names, the method not only needs to occupy a large amount of memory space, but also cannot quickly and accurately identify the proper nouns, thereby reducing the user experience.
When a data search request input by a user is received, responding to the data search request, and acquiring information such as search words carried in the data search request. The first characteristic information can be pinyin syllable segments of search words in the data search requests, and each search word in the data search requests can contain one or more continuous or discontinuous pinyin syllable segments; analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition, wherein the analyzing comprises the following steps: detecting whether a search word in the data search request comprises a pinyin syllable segment; and if the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting a preset characteristic condition, wherein the pinyin syllable segment is the first characteristic information. The method comprises the steps of analyzing whether a search word of a data search request contains a phenomenon of combination of Chinese characters and pinyin syllable segments or a phenomenon of combination of Chinese characters and pinyin syllable shorthand (such as a pinyin syllable initial), if so, determining that the data search request comprises first characteristic information meeting preset characteristic conditions, wherein the first characteristic information is the pinyin syllable segments or the shorthand of the pinyin syllable segments in the search word. If the search word of the data search request is "shendi" which includes the pinyin syllable segment "hyan", the pinyin syllable segment "hyan" in the search word "shendi hyan" is determined as the first feature information.
Since the english word and the pinyin syllable segment are both composed of letters, when determining whether the first characteristic information is the pinyin syllable segment, a determination deviation is easy to occur, therefore, when determining the first characteristic information, it can be further determined whether the pinyin syllable segment is an english word, when determining the english word, it can be determined that the first characteristic information cannot satisfy a preset characteristic condition, and the english word can be ignored, for example, "universal key WIFI company office diz", can discriminate that "WIFI" is a non-pinyin syllable segment, and the pinyin syllable segment "diz" in "company office diz" is the first characteristic information (pinyin syllable segment) satisfying the preset characteristic condition.
In step S120, word segmentation processing may be performed on the first feature information according to a preset word segmentation policy, and multiple groups of rewriting candidate words corresponding to the first feature information are obtained.
The preset word segmentation strategy can be any one or more of a tree word segmentation method, an initial consonant word segmentation method, an initial and final consonant comparison table word segmentation method, a forward and backward maximum matching word segmentation method and the like, and can be divided into a plurality of groups of rewritten candidate words through the preset word segmentation strategy according to first characteristic information containing pinyin syllable segments.
When first characteristic information meeting preset characteristic conditions is included, performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, wherein the word segmentation processing comprises the following steps: segmenting the first characteristic information according to an initial consonant and final consonant comparison table; or performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary. For example, after determining the first feature information "youxiangs" for the search word "network technology youxiangs", performing word segmentation processing on the first feature information "youxiangs" according to segmentation of an initial/final comparison table or a forward maximum word segmentation matching algorithm based on a pinyin byte dictionary, and obtaining multiple groups of word segments, such as: "you-xian-g-s", "you-xi-an-g-s", "you-xiang-s", etc.; the word may then be rewritten to obtain multiple groups of rewrite candidate words, for example: "company Limited", "Games Ann company", "have similar" …, and so on. In the embodiment of the present invention, different rewrite rules of candidate words may be preset according to different application scenarios, for example, when the method is applied to the enterprise information query industry, a certain enterprise information rule may be summarized according to enterprise information, and the rewrite candidate word may be calculated according to the enterprise information rule, for example, since most of the suffixes of the enterprise names are limited companies, the rewrite candidate word "limited company" of the pinyin syllable segment participle may be obtained according to the enterprise suffixes in the enterprise information rule.
In step S130, the multiple groups of rewriting candidate words may be ranked according to a preset ranking algorithm, so as to obtain a ranked candidate search set.
Specifically, the step of sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set may include multiple ways, for example:
1. calculating the scores of all groups of rewritten candidate words according to a preset sorting algorithm to obtain a score result; and sorting the rewriting candidate words according to the grading result to obtain a sorted candidate search set.
For example, the confusion (perplexity) calculation may be performed on the multiple groups of rewritten candidate words after the word segmentation processing, so as to obtain a language model confusion score (ppl); the lower the ppl score, the higher the overall score, e.g., the rewrite candidates "kendyl" (ppl 0.91) and "kendyr chicken" (ppl 11.2) for "kendyj", the better the former. If the scoring result (the confusion score ppl) is higher than a preset discarding threshold, the rewrite candidate word may be discarded. And then, performing ascending sorting on the rewriting candidate words according to the level of the grading result (the confusion score) to obtain a sorted candidate search set. Specifically, based on the statistical language model of ngram, the confusion degree of the candidate rewriting words can be calculated, and the lower the confusion degree score is, the more smooth the statement is, and the better the rewriting effect of the candidate rewriting words is. The language model needs to be trained by collecting related corpora in combination with a service scene, and can receive candidate rewrite words in real time on line and give corresponding confusion degree scores. And sorting according to the confusion degree score to generate a sorted candidate search set.
2. Judging the number of independent syllables of each rewritten candidate word; and sequencing each group of rewritten candidate words according to the number of the independent syllables to obtain a sequenced candidate search set.
When the modified candidate words are ranked according to the number of independent syllables of each modified candidate word, the modified candidate words can be ranked in a descending order according to the number of the independent syllables, and a candidate search set is obtained. If the data search request is "beijing jinembankment kejyouxgs", determining that the first characteristic information in the data search request is "kejyouxgs", performing word segmentation processing on the first characteristic information, and obtaining a rewriting candidate word including "beijing jinembankment technology limited", and "beijing jinembankment technology priority", wherein the number of independent syllables of the rewriting candidate word "beijing jinembankment technology limited" is the largest, so that the rewriting candidate word "beijing jinembankment technology limited" is arranged before the rewriting candidate word "beijing jinembankment technology limited", "beijing jinembankment technology priority", and the rewriting candidate word is sequentially ordered according to the judgment mode of the number of independent syllables to generate an ordered candidate search set.
3. Determining syllable prefix matching degree of each rewriting candidate word; and sequencing all the groups of rewritten candidate words according to the syllable prefix matching degree to obtain a sequenced candidate search set.
And sorting according to the syllable prefix matching degree of each rewriting candidate word, and sorting the rewriting candidate words in a descending order according to the syllable prefix matching degree to obtain a candidate search set. If the data search request is "goodness dia", determining that the first feature information in the data search request is "dia", performing word segmentation processing on the first feature information, and obtaining rewriting candidate words including "goodness power", "goodness ground", and the like, wherein the syllable prefix matching degree of the rewriting candidate word "goodness power" is highest, so that the rewriting candidate word "goodness power" is arranged in front of the rewriting candidate words "goodness power", "goodness ground", and is sequentially sorted according to the syllable prefix matching degree, and a sorted candidate search set is generated.
4. Acquiring fuzzy search result data according to each group of candidate rewriting words; determining the word segmentation frequency of each group of candidate rewriting words in fuzzy search result data; and sequencing all groups of rewritten candidate words according to the word segmentation frequency to obtain a sequenced candidate search set. For example:
obtaining fuzzy search result data according to each group of candidate rewriting words, performing descending order sorting on each group of rewriting candidate words according to the number of word segmentation frequency of the fuzzy search result data, if the data search request is 'shenan', determining that first characteristic information in the data search request is 'hyan', performing fuzzy search on the first characteristic information, and obtaining a plurality of results including 'fannese holy sea company', 'jenima holy sea company', 'navier holy sea company', 'flower holy sea ya company' and the like, wherein the word segmentation frequency of 'sheng sea' is the highest among all fuzzy search result data, so that the rewriting candidate word 'sheng di sea' is arranged before the candidate word 'sheng di sea', obtaining fuzzy search result data according to each group of candidate words, and sorting each group of candidate words according to the word segmentation frequency of the fuzzy search result data, and generating the ordered candidate search set.
In step S140, first recall result data may be obtained according to the candidate search set.
In the embodiment of the present example, the corresponding search result data may be recalled according to each rewritten candidate word in the ranked candidate search set, and the ranking condition in the candidate search set may be used as a recall order of the search result data; and the search result data obtained by taking the sorting condition in the candidate search set as a recall sequence is the first recall result data, and the first recall result data is used as the response information of the data search request.
In an embodiment of the present example, the method further comprises: acquiring second characteristic information included in a search word in the data search request; extracting word granularity and phrase granularity of the second characteristic information; acquiring second recall result data corresponding to second characteristic information according to the word granularity and the phrase granularity of the second characteristic information; and taking the first recall result data and the second recall result data as response information of the data search request.
In the embodiment of the present example, besides the pinyin syllable segment in the data search request as the first characteristic information, the chinese character text portion in the data search request may be used as the second characteristic information, and the word granularity and the phrase granularity of the chinese character text portion are extracted, where the granularity is a measure of the amount of information contained in the text. The text contains a large amount of information, the granularity is large, and otherwise, the granularity is small. Searching corresponding search result data from a database according to the word granularity and the phrase granularity of the text part of the Chinese character, acquiring second recall result data corresponding to second characteristic information, and taking the second recall result and the first recall result data together as response information of the data search request.
It should be noted that although the various steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that these steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.
Further, in the present exemplary embodiment, a search data identification apparatus is also provided. Referring to fig. 2, the search data recognition apparatus 200 may include: a feature analysis module 210, a word segmentation processing module 220, a candidate word ranking module 230, and a result recall module 240. Wherein:
the characteristic analysis module 210 is configured to, in response to an input data search request, analyze whether a search word in the data search request includes first characteristic information that satisfies a preset characteristic condition;
the word segmentation processing module 220 is configured to perform word segmentation processing on the first feature information according to a preset word segmentation strategy, and obtain multiple groups of rewritten candidate words corresponding to the first feature information;
a candidate word sorting module 230, configured to sort the multiple groups of rewritten candidate words according to a preset sorting algorithm, so as to obtain a sorted candidate search set;
and a result recalling module 240, configured to obtain first recall result data according to the candidate search set.
The specific details of each search data identification device module are already described in detail in the corresponding search data identification method, and therefore are not described herein again.
It should be noted that although in the above detailed description reference is made to several modules or units of the search data recognition apparatus 200, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In addition, in an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 300 according to such an embodiment of the invention is described below with reference to fig. 3. The electronic device 300 shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 3, electronic device 300 is embodied in the form of a general purpose computing device. The components of electronic device 300 may include, but are not limited to: the at least one processing unit 310, the at least one memory unit 320, a bus 330 connecting different system components (including the memory unit 320 and the processing unit 310), and a display unit 340.
Wherein the storage unit stores program code that is executable by the processing unit 310 to cause the processing unit 310 to perform steps according to various exemplary embodiments of the present invention as described in the above section "exemplary method" of the present specification. For example, the processing unit 310 may perform steps S110 to S140 as shown in fig. 1.
The storage unit 320 may include storage media in the form of volatile storage units, such as a random access storage unit (RAM)3201 and/or a cache storage unit 3202, and may further include a read only storage unit (ROM) 3203.
The storage unit 320 may also include a program/utility 3204 having a set (at least one) of program modules 3205, such program modules 3205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 330 may be one or more of several types of bus structures, including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 300 may also communicate with one or more external devices 370 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 300, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 300 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/O) interface 350. Also, the electronic device 300 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 360. As shown, network adapter 360 communicates with the other modules of electronic device 300 via bus 330. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with electronic device 300, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, aspects of the invention may also be implemented in the form of a program product comprising program code means for causing a terminal device to carry out the steps according to various exemplary embodiments of the invention described in the above-mentioned "exemplary methods" section of the present description, when said program product is run on the terminal device.
Referring to fig. 4, a program product 400 for implementing the above method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more storage media. The storage medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any storage medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes involved in methods according to exemplary embodiments of the invention, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (11)

1. A search data identification method, the method comprising:
responding to an input data search request, and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition;
if the first characteristic information meeting the preset characteristic condition is included, performing word segmentation processing on the first characteristic information according to a preset word segmentation strategy, and obtaining multiple groups of rewriting candidate words corresponding to the first characteristic information;
sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
and acquiring first recall result data according to the candidate search set.
2. The method of claim 1, wherein the analyzing whether the search terms in the data search request include first feature information that satisfies a preset feature condition comprises:
detecting whether a search word in the data search request comprises a pinyin syllable segment;
and if the search word comprises a pinyin syllable segment, determining that the search word in the data search request comprises first characteristic information meeting a preset characteristic condition, wherein the pinyin syllable segment is the first characteristic information.
3. The method of claim 1, wherein ranking the plurality of groups of rewritten candidate words according to a preset ranking algorithm to obtain a ranked candidate search set comprises:
calculating the scores of all groups of rewritten candidate words according to a preset sorting algorithm to obtain a score result;
and sorting the rewriting candidate words according to the grading result to obtain a sorted candidate search set.
4. The method of claim 2, wherein the plurality of groups of rewritten candidate words are ranked according to a preset ranking algorithm to obtain a ranked candidate search set, and the ranked candidate search set comprises any one or more of:
judging the number of independent syllables of each rewritten candidate word; sorting each group of rewritten candidate words according to the number of independent syllables to obtain a sorted candidate search set;
alternatively, the first and second electrodes may be,
determining syllable prefix matching degree of each rewriting candidate word; and sequencing all the groups of rewritten candidate words according to the syllable prefix matching degree to obtain a sequenced candidate search set.
5. The method of claim 2, wherein after obtaining a plurality of sets of rewrite candidate words corresponding to the first trait information, the method further comprises:
acquiring fuzzy search result data according to each group of candidate rewriting words;
determining the word segmentation frequency of each group of candidate rewriting words in fuzzy search result data;
sorting the multiple groups of rewriting candidate words according to a preset sorting algorithm to obtain a sorted candidate search set, including:
and sequencing all groups of rewritten candidate words according to the word segmentation frequency to obtain a sequenced candidate search set.
6. The method of any one of claims 1-3, wherein sorting the plurality of groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set comprises:
performing confusion calculation on the multiple groups of rewritten candidate words after word segmentation processing to obtain confusion scores;
and performing ascending sorting on the rewriting candidate words with the lowest confusion score in a preset number according to the confusion score to obtain a sorted candidate search set.
7. The method of claim 1, wherein performing word segmentation processing on the first feature information according to a preset word segmentation strategy comprises:
segmenting the first characteristic information according to an initial consonant and final consonant comparison table; alternatively, the first and second electrodes may be,
and performing word segmentation processing on the first characteristic information based on a forward maximum word segmentation matching algorithm of the pinyin byte dictionary.
8. The method of claim 1, wherein the method further comprises:
acquiring second characteristic information included in a search word in the data search request;
and acquiring corresponding search result data by combining a plurality of groups of rewritten candidate words in the sorted candidate search set according to the second characteristic information, wherein the search result data is the first recall result data.
9. An apparatus for identifying search data, the apparatus comprising:
the characteristic analysis module is used for responding to an input data search request and analyzing whether a search word in the data search request comprises first characteristic information meeting a preset characteristic condition;
the word segmentation processing module is used for performing word segmentation processing on first characteristic information according to a preset word segmentation strategy when the first characteristic information meets a preset characteristic condition, and obtaining a plurality of groups of rewriting candidate words corresponding to the first characteristic information;
the candidate word sorting module is used for sorting the multiple groups of rewritten candidate words according to a preset sorting algorithm to obtain a sorted candidate search set;
and the result recalling module is used for acquiring first recall result data according to the candidate search set.
10. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.
11. A computer storage medium on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the method of any one of claims 1-8.
CN202011191952.8A 2020-10-30 2020-10-30 Search data identification method, apparatus, electronic device and computer storage medium Active CN112307183B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011191952.8A CN112307183B (en) 2020-10-30 2020-10-30 Search data identification method, apparatus, electronic device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011191952.8A CN112307183B (en) 2020-10-30 2020-10-30 Search data identification method, apparatus, electronic device and computer storage medium

Publications (2)

Publication Number Publication Date
CN112307183A true CN112307183A (en) 2021-02-02
CN112307183B CN112307183B (en) 2024-04-19

Family

ID=74333104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011191952.8A Active CN112307183B (en) 2020-10-30 2020-10-30 Search data identification method, apparatus, electronic device and computer storage medium

Country Status (1)

Country Link
CN (1) CN112307183B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569010A (en) * 2021-07-23 2021-10-29 北京百度网讯科技有限公司 Method, device, equipment and storage medium for filtering search results

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120297294A1 (en) * 2011-05-17 2012-11-22 Microsoft Corporation Network search for writing assistance
US20150370859A1 (en) * 2014-06-23 2015-12-24 Google Inc. Contextual search on multimedia content
CN107609098A (en) * 2017-09-11 2018-01-19 北京金堤科技有限公司 Searching method and device
CN108108497A (en) * 2018-01-29 2018-06-01 上海名轩软件科技有限公司 Keyword recommendation method and equipment
CN108170293A (en) * 2017-12-29 2018-06-15 北京奇虎科技有限公司 Input the personalized recommendation method and device of association
US20190057159A1 (en) * 2017-08-15 2019-02-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for recalling for search
CN109828981A (en) * 2017-11-22 2019-05-31 阿里巴巴集团控股有限公司 A kind of data processing method and calculate equipment
CN110619076A (en) * 2018-12-25 2019-12-27 北京时光荏苒科技有限公司 Search term recommendation method and device, computer and storage medium
WO2020062680A1 (en) * 2018-09-30 2020-04-02 平安科技(深圳)有限公司 Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium
CN111324700A (en) * 2020-02-21 2020-06-23 北京声智科技有限公司 Resource recall method and device, electronic equipment and computer-readable storage medium
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field
CN111428494A (en) * 2020-03-11 2020-07-17 中国平安人寿保险股份有限公司 Intelligent error correction method, device and equipment for proper nouns and storage medium
CN111488426A (en) * 2020-04-17 2020-08-04 支付宝(杭州)信息技术有限公司 Query intention determining method and device and processing equipment
CN111737977A (en) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 Data dictionary generation method, data query method, device, equipment and medium

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120297294A1 (en) * 2011-05-17 2012-11-22 Microsoft Corporation Network search for writing assistance
US20150370859A1 (en) * 2014-06-23 2015-12-24 Google Inc. Contextual search on multimedia content
US20190057159A1 (en) * 2017-08-15 2019-02-21 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus, server, and storage medium for recalling for search
CN107609098A (en) * 2017-09-11 2018-01-19 北京金堤科技有限公司 Searching method and device
CN109828981A (en) * 2017-11-22 2019-05-31 阿里巴巴集团控股有限公司 A kind of data processing method and calculate equipment
CN108170293A (en) * 2017-12-29 2018-06-15 北京奇虎科技有限公司 Input the personalized recommendation method and device of association
CN108108497A (en) * 2018-01-29 2018-06-01 上海名轩软件科技有限公司 Keyword recommendation method and equipment
WO2020062680A1 (en) * 2018-09-30 2020-04-02 平安科技(深圳)有限公司 Waveform splicing method and apparatus based on double syllable mixing, and device, and storage medium
CN110619076A (en) * 2018-12-25 2019-12-27 北京时光荏苒科技有限公司 Search term recommendation method and device, computer and storage medium
CN111324700A (en) * 2020-02-21 2020-06-23 北京声智科技有限公司 Resource recall method and device, electronic equipment and computer-readable storage medium
CN111369996A (en) * 2020-02-24 2020-07-03 网经科技(苏州)有限公司 Method for correcting text error in speech recognition in specific field
CN111428494A (en) * 2020-03-11 2020-07-17 中国平安人寿保险股份有限公司 Intelligent error correction method, device and equipment for proper nouns and storage medium
CN111488426A (en) * 2020-04-17 2020-08-04 支付宝(杭州)信息技术有限公司 Query intention determining method and device and processing equipment
CN111737977A (en) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 Data dictionary generation method, data query method, device, equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
宫法明;朱朋海;: "基于自适应隐马尔可夫模型的石油领域文档分词", 计算机科学, no. 1, 15 June 2018 (2018-06-15), pages 110 - 113 *
白双成;: "蒙古文原始语料统计建模研究", 中文信息学报, no. 01, 15 January 2017 (2017-01-15), pages 123 - 130 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113569010A (en) * 2021-07-23 2021-10-29 北京百度网讯科技有限公司 Method, device, equipment and storage medium for filtering search results
CN113569010B (en) * 2021-07-23 2023-12-12 北京百度网讯科技有限公司 Method, device, equipment and storage medium for filtering search result

Also Published As

Publication number Publication date
CN112307183B (en) 2024-04-19

Similar Documents

Publication Publication Date Title
CN106897439B (en) Text emotion recognition method, device, server and storage medium
US20170270912A1 (en) Language modeling based on spoken and unspeakable corpuses
EP3819785A1 (en) Feature word determining method, apparatus, and server
CN113495900A (en) Method and device for acquiring structured query language sentences based on natural language
CN111597800B (en) Method, device, equipment and storage medium for obtaining synonyms
CN108205524B (en) Text data processing method and device
JP2018010514A (en) Parallel translation dictionary creation device, parallel translation dictionary creation method, and parallel translation dictionary creation program
US10049108B2 (en) Identification and translation of idioms
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
US10354013B2 (en) Dynamic translation of idioms
CN113128209A (en) Method and device for generating word stock
CN111460810A (en) Crowd-sourced task spot check method and device, computer equipment and storage medium
CN111339768A (en) Sensitive text detection method, system, electronic device and medium
CN114692628A (en) Sample generation method, model training method, text extraction method and text extraction device
CN112699237B (en) Label determination method, device and storage medium
CN112307183B (en) Search data identification method, apparatus, electronic device and computer storage medium
CN112560425A (en) Template generation method and device, electronic equipment and storage medium
US8666987B2 (en) Apparatus and method for processing documents to extract expressions and descriptions
CN114742062B (en) Text keyword extraction processing method and system
CN115827867A (en) Text type detection method and device
CN113362809B (en) Voice recognition method and device and electronic equipment
CN114528851A (en) Reply statement determination method and device, electronic equipment and storage medium
CN114118049A (en) Information acquisition method and device, electronic equipment and storage medium
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
CN112926297A (en) Method, apparatus, device and storage medium for processing information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant