CN108416055B - Method and device for establishing pinyin database, electronic equipment and storage medium - Google Patents

Method and device for establishing pinyin database, electronic equipment and storage medium Download PDF

Info

Publication number
CN108416055B
CN108416055B CN201810229847.5A CN201810229847A CN108416055B CN 108416055 B CN108416055 B CN 108416055B CN 201810229847 A CN201810229847 A CN 201810229847A CN 108416055 B CN108416055 B CN 108416055B
Authority
CN
China
Prior art keywords
probability
pinyin
pronunciation
polyphone
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810229847.5A
Other languages
Chinese (zh)
Other versions
CN108416055A (en
Inventor
张好
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sankuai Online Technology Co Ltd
Original Assignee
Beijing Sankuai Online Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sankuai Online Technology Co Ltd filed Critical Beijing Sankuai Online Technology Co Ltd
Priority to CN201810229847.5A priority Critical patent/CN108416055B/en
Publication of CN108416055A publication Critical patent/CN108416055A/en
Application granted granted Critical
Publication of CN108416055B publication Critical patent/CN108416055B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The disclosure provides a method and a device for establishing a pinyin database, a method and a device based on pinyin retrieval, electronic equipment and a computer-readable storage medium, and relates to the technical field of internet. The method comprises the following steps: constructing a pinyin database, wherein the pinyin database comprises each pronunciation and the initial probability of polyphone in the Chinese characters; establishing an object identifier-pinyin index according to the pinyin database; searching click behavior data according to history to obtain statistical probability of each pronunciation of corresponding polyphone; obtaining the current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability; and updating the object identification-pinyin index according to the current probability of each pronunciation of the corresponding polyphone. The method and the device can display the search results commonly used by the user in the pinyin search related to polyphone characters and remove redundant search results caused by uncommon pronunciation.

Description

Method and device for establishing pinyin database, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of internet technologies, and in particular, to a method and an apparatus for establishing a pinyin database, a method and an apparatus for pinyin-based retrieval, an electronic device, and a computer-readable storage medium.
Background
With the rapid increase in the amount of information in the internet, search services are becoming more and more important. For example, people search destinations in a map or travel software when going out, search restaurants in comment or take-away software when having meals, search user names in social software when making friends through a network, and the like. Compared with Chinese character search, pinyin search has the advantages of convenient input, large fuzzy search range and the like, and is widely popular with users. At present, many search engines and search services in application programs support Chinese character and pinyin search at the same time.
In the pinyin searching process, polyphone characters are frequently encountered, most of the prior art adopts semantic recognition, for example, "Chongqing" is recognized as "chongqing", and users cannot search out "Chongqing" through "zhongqing". However, this solution cannot identify semantically-free kanji text, such as name, business name, etc. When a semanteme-free Chinese character text is processed, some prior arts do not support polyphones, for example, "seed" as a polyphone can be read as "zhong" (three tones), "zhong" (four tones) and "chong", when a user searches for "chongyang", the user cannot search for "seed ocean", so that the user cannot obtain a desired search result; another part of the prior art supports all pronunciations of polyphones, for example, "wax" may be read as "la" and "xi", when a user searches for "zhangximei", the result of "zhangmei" may be matched, and most users do not know that "wax" also reads as "xi", which may be considered as an erroneous result, affecting the search experience.
Therefore, the technical scheme of the existing pinyin search has the problems of incomplete search results or redundant search results.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present disclosure, and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The present disclosure is directed to a method and apparatus for establishing a pinyin database, a method and apparatus for pinyin-based retrieval, an electronic device, and a computer storage medium, which overcome the problems of redundancy of search results or incomplete search results of pinyin search due to limitations and defects of the related art, at least to some extent.
Additional features and advantages of the disclosure will be set forth in the detailed description which follows, or in part will be obvious from the description, or may be learned by practice of the disclosure.
According to an aspect of the present disclosure, there is provided a method for establishing a pinyin database, including: constructing a pinyin database, wherein the pinyin database comprises each pronunciation and the initial probability of polyphone in the Chinese characters; establishing an object identifier-pinyin index according to the pinyin database; searching click behavior data according to history to obtain statistical probability of each pronunciation of corresponding polyphone; obtaining the current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability; and updating the object identification-pinyin index according to the current probability of each pronunciation of the corresponding polyphone.
In an exemplary embodiment of the present disclosure, the constructing a pinyin database, where the pinyin database includes each pronunciation of a polyphone in a chinese character and an initial probability thereof, includes: dividing the object identification into a first part and a second part; setting a first initial probability for a first pronunciation of a polyphonic word in the first portion; setting a second initial probability for a second reading of the polyphonic word in the first portion; the first initial probability is greater than the second initial probability, and the sum of the first initial probability and the second initial probability is a preset constant.
In an exemplary embodiment of the present disclosure, the constructing a pinyin database, where the pinyin database includes each pronunciation of a polyphone in a chinese character and an initial probability thereof, further includes: setting the same third initial probability for each pronunciation of the polyphones in the second part; wherein the sum of the third initial probabilities for each pronunciation is the predetermined constant.
In an exemplary embodiment of the disclosure, the obtaining a statistical probability of each pronunciation of the corresponding polyphonic character according to the historical search click behavior data includes: counting search requests of keywords input by a user and including pinyin; recording the clicked object identification corresponding to the search request; and obtaining the statistical probability of each pronunciation of the corresponding polyphone according to the search request and the corresponding clicked object identifier.
In an exemplary embodiment of the present disclosure, the updating the object id-pinyin index according to the current probability of each pronunciation of the corresponding polyphone includes: and removing the object identification-pinyin index corresponding to the pronunciation of which the current probability is lower than a probability threshold value in the corresponding polyphone.
According to an aspect of the present disclosure, there is provided a pinyin-based retrieval method, including: receiving an input search request, wherein the search request comprises pinyin of a target object identifier; obtaining a search result according to an object identifier-pinyin index in a pre-established pinyin database and the search request, wherein the pinyin database comprises probability information of each pronunciation of polyphone in the object identifier; sorting the search results according to the probability information; wherein the probability information is obtained according to the initial probability of each pronunciation of the polyphone and the historical search click behavior data.
According to an aspect of the present disclosure, there is provided an apparatus for establishing a pinyin database, including: the database construction module is used for constructing a pinyin database, and the pinyin database comprises each pronunciation and the initial probability of polyphone in the Chinese characters; the index establishing module is used for establishing an object identifier-pinyin index according to the pinyin database; the probability statistic module is used for searching click behavior data according to history to obtain the statistic probability of each pronunciation of the corresponding polyphone; a probability obtaining module for obtaining the current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability; and the index updating module is used for updating the object identification-pinyin index according to the current probability of each pronunciation of the corresponding polyphone.
According to an aspect of the present disclosure, there is provided a pinyin-based retrieval apparatus, including: the request receiving module is used for receiving an input search request, wherein the search request comprises pinyin of a target object identifier; a search result acquisition module, configured to acquire a search result according to an object identifier-pinyin index and the search request in a pinyin database established in advance, where the pinyin database includes probability information of each pronunciation of a polyphone in the object identifier; the result sorting module is used for sorting the search results according to the probability information; wherein the probability information is obtained according to the initial probability of each pronunciation of the polyphone and the historical search click behavior data.
According to an aspect of the present disclosure, there is provided an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the program implementing any of the method steps described above when executed by the processor.
According to an aspect of the present disclosure, there is provided a computer readable medium, having stored thereon a computer program, which when executed by a processor, performs the method steps of any of the above.
Exemplary embodiments of the present disclosure have the following advantageous effects:
in the method for establishing the pinyin database, initial probability is set for each pronunciation of the polyphone in the pinyin database, statistical probability of each pronunciation in historical search click behavior data is counted, current probability is determined according to the initial probability and the statistical probability, and an object identifier-pinyin index is updated to establish the pinyin database with dynamically updated index state. On one hand, the object identification-pinyin index containing polyphone is made based on the current probability of each pronunciation, reflects the actual use condition of the user, can contain all pronunciations commonly used by the user, and can remove redundant uncommon pronunciations, so that the user can conveniently and quickly find out a target result during pinyin search, and the search experience of the user is improved. On the other hand, the pinyin database can collect and count search click behavior data of the user, and update the object identifier-pinyin index according to the calculated current probability, so that the automation of regular updating and maintenance of the pinyin database can be realized, and the efficiency of establishing the pinyin database is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without the exercise of inventive faculty.
FIG. 1 illustrates a flow diagram of a method for building a pinyin database;
FIG. 2 illustrates a system architecture diagram of a web application;
FIG. 3 illustrates a system architecture diagram for a stand-alone application;
FIG. 4 illustrates a flow chart of a method of obtaining statistical probabilities of the pronunciation of polyphones;
FIG. 5 illustrates a flow diagram of a method for pinyin-based retrieval;
FIG. 6 is a block diagram of an apparatus for building a Pinyin database;
FIG. 7 is a block diagram of an apparatus for pinyin-based retrieval;
FIG. 8 illustrates an electronic device for implementing the above-described method;
fig. 9 illustrates a computer-readable storage medium for implementing the above-described method.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
It is noted that in this disclosure, the terms "a," "an," "the," and "said" are used to indicate the presence of one or more elements/components/etc.; the term "comprising" is used in an open-ended inclusive sense and means that there may be additional elements/components/etc. other than the listed elements/components/etc.; the terms "first" and "second", etc. are used merely as labels, and are not limiting on the number of objects.
An exemplary embodiment of the present disclosure first provides a pinyin database establishment method. The pinyin database can be used for searching semantically-free Chinese character texts in an application program through pinyin, such as pinyin search names, enterprise names, restaurant names, cell names and the like. As shown in fig. 1, the method may include the steps of:
step S110, a pinyin database is established, and the pinyin database comprises each pronunciation and the initial probability of polyphones in the Chinese characters.
The pinyin database is an initial database and can be constructed according to a dictionary, wherein the initial database comprises mapping between polyphones and all corresponding pronunciations, such as mapping from wax to la and xi, and mapping from 'variety' to 'zhong', 'zhong' (three tones), and 'zhong' (four tones) and 'chong'; the pinyin database can also be constructed by using the existing pinyin tools supporting polyphones, such as pinyin4j (a pinyin database in Java format) open source files and the like. The initial probability refers to the probability of mapping the polyphone to each pronunciation, and may be assigned according to a preset rule or experience, for example, the initial probabilities of all pronunciations may be the same, such as the initial probabilities of "wax" mapping to "la" and "xi" are 50% respectively, or the initial probabilities of commonly used pronunciations may be increased, such as the initial probability of "wax" mapping to "la" is 80%, the initial probability of "xi" mapping to "xi" is 20%, and the like. The sum of the initial probabilities of all the pronunciations of each polyphone may be 1 or less than 1 to leave a certain small probability for the error pronunciations that are liable to occur.
And step S120, establishing an object identifier-pinyin index according to the pinyin database.
The object identification refers to Chinese characters or Chinese character combinations which can be searched by pinyin in an application program, such as contact names in an address book, user names in social applications and the like. The object identifier is stored in the program in the form of Chinese characters, and the program establishes an object identifier-pinyin index for the object identifier according to the mapping in the pinyin database. The index may be recorded in the background of the program, or may be presented in a specific form, such as automatically arranging the contact names according to their initials in the address book, displaying pinyin beside the Chinese characters of the user name in the social application, and so on. For the object identification containing polyphone, a plurality of object identification-pinyin indexes can be established according to different pronunciations of the polyphone, and the index probability can be set for the corresponding object identification-pinyin indexes according to the initial probability of each pronunciation. The index probabilities may be used to set different priorities for each index, for example, to make the index corresponding to the pronunciation with the highest initial probability be ranked at the front position, or to display only the index corresponding to the pronunciation with the highest initial probability and hide the indexes corresponding to other pronunciations in the program background, or may be used to compare the search results with other indexes with the same pinyin and different object identifiers and make the index with the higher index probability be ranked at the front position.
Step S130, according to historical search click behavior data, obtaining the statistical probability of each pronunciation of the corresponding polyphone.
The search click behavior refers to a search request received by the program, wherein the search request comprises all or part of keywords input in a pinyin mode, and click behavior received by the program for selecting a specific search result after the program provides the search result. In this embodiment, only the historical search click behavior data for polyphones may be counted. Taking the polyphone "music" as an example, the number of times that the user searches for "music" through "yue" and "le" respectively can be counted, wherein the ratio of the number of times of searching for "yue" to the total number of times is the statistical probability thereof, and the sum of the statistical probabilities of all the pronunciations of each polyphone can be 1.
Step S140, determining the current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability.
In the embodiment of the disclosure, the initial probability is a probability initial value assigned according to a preset rule or experience, the statistical probability is a statistical result of different audio frequencies used by a user, and the statistical probability generally reflects the current actual use condition, but the statistical probability may have the problems of short statistical period, incomplete statistical range and the like, and the initial probability and the statistical probability may be used for calculation to obtain the current probability, for example, the initial probability and the statistical probability are averaged, and the weighted average is equal. The calculation method of the current probability will be specifically described in the following embodiments.
Step S150, updating the object identification-pinyin index according to the current probability of each pronunciation of the corresponding polyphone.
In the embodiment of the disclosure, the object identifier containing polyphone includes a plurality of object identifiers, namely pinyin indexes, wherein the pronunciation with high initial probability generally has high priority of the index. After the current probability of each pronunciation is determined, the index of the corresponding object identifier may be updated, for example, the priority of the index corresponding to each pronunciation is rearranged according to the current probability, the index corresponding to the pronunciation with the too low current probability is removed, the index corresponding to the pronunciation with the very high current probability is highlighted, and the like.
In the exemplary embodiment, an initial probability is set for each pronunciation of the polyphone in the pinyin database, the statistical probability of each pronunciation in the historical search click behavior data is counted, the current probability is determined according to the initial probability and the statistical probability, and the object identifier-pinyin index is updated to establish the pinyin database with dynamically updated index states. On one hand, the object identification-pinyin index containing polyphone is made based on the current probability of each pronunciation, reflects the actual use condition of the user, not only can contain all pronunciations commonly used by the user, but also can remove redundant uncommon pronunciations, so that the user can conveniently and quickly find out a target result during pinyin search; meanwhile, the state of the object identification-pinyin index can be correspondingly adjusted according to changes of an object identification library, pinyin input habits and the like of the user, and the search experience of the user is improved. On the other hand, the pinyin database can collect and count search click behavior data of the user, and update the object identifier-pinyin index according to the calculated current probability, so that the automation of regular updating and maintenance of the pinyin database can be realized, and the efficiency of establishing the pinyin database is improved.
Fig. 2 shows a system architecture 200 of a web application, which may include one or more of terminal devices 201, 202, 203, a server 204 and a database 205. The method of the embodiment may be applied to the server 204, and the pinyin database which is indexed by using the usage preferences of all the users is established by collecting and counting the historical search click behavior data of all the users. Fig. 3 shows a system architecture 300 of a stand-alone application, which may install a client of the application on a terminal device 301, 302, 303, construct a pinyin database, update a pinyin index of an object identifier according to a use condition of the terminal user, and establish the pinyin database using the use preference of the terminal user as an index guide.
When the object identification is a name, polyphones often have a special pronunciation in surnames, so a higher initial probability can be set for the special pronunciation. In an exemplary embodiment, the step of constructing a pinyin database including each pronunciation of the polyphonic characters in the chinese characters and the initial probabilities thereof may include: dividing the object identification into a first part and a second part; setting a first pronunciation of a polyphonic word in the first portion to a first initial probability; setting a second initial probability for a second reading of the polyphonic words in the first portion; the first initial probability is greater than the second initial probability, and the sum of the first initial probability and the second initial probability is a preset constant.
In the embodiment of the present disclosure, the first part may be a last name in the name, the second part may be a first name, and taking the name "happy" as an example, the "ever" is the first part, the "happy" is the second part, the "ever" has two pronunciations of "zeng" and "ceng", and the last name is usually read as "zeng", so that a higher initial probability may be set for "zeng", for example, the initial probability may be set to 90%, and the initial probability of "ceng" may be 10%. Typically, the sum of the first initial probability and the second initial probability may be 1, or may be a constant less than 1, such as 0.9, etc., and a small probability may be set aside for indexing with the misspellings that are easy to occur, such as "zwng", "zemg", etc.
It should be added that the above method is equally applicable to polyphones with third pronunciation or even fourth pronunciation, such as "home" readable "xiu" (triphone), "xiu" (tetraphone) and "su" (read as the last name "su"), a higher initial probability may be set for "su" in the last name part, a lower initial probability may be set for other pronunciations, i.e. the first initial probability may be larger than the second initial probability, the third initial probability, etc.
The various pronunciations of the name part of a polyphone in a name may theoretically have the same initial probability. In an exemplary embodiment, the step of constructing a pinyin database, wherein the pinyin database includes each pronunciation of the polyphonic characters in the chinese characters and the initial probability thereof, may further include: setting the same third initial probability for each pronunciation of the polyphone in the second part; wherein the sum of the third initial probabilities for each pronunciation is the predetermined constant. Taking "Zeng le" as an example, the initial probability of "le" reading "le" and "yue" in the name may both be 50%. The preset constant is the sum of initial probabilities of various pronunciations of the polyphone, and may be 1 or less than 1.
In an exemplary embodiment, referring to fig. 4, the statistical history searching click behavior data and obtaining the statistical probability of each pronunciation of the corresponding polyphonic character may include the following steps: step S401, counting search requests of keywords input by a user and including pinyin; step S402, recording the clicked object identification corresponding to the search request; step S403, obtaining the statistical probability of each pronunciation of the corresponding polyphone according to the search request and the corresponding clicked object identifier.
When a user performs a search, a keyword may be generally input in two forms of a chinese character and a pinyin, in this embodiment, only a search request in which the keyword includes a pinyin may be counted, for example, the keyword may be a pure pinyin or a pinyin + chinese character mixture, and it is recorded which object identifier the user clicks in an object identifier list of a search result, and a search click record of a group of pinyin-object identifiers is formed, for example, if the user searches "zhangyuelie" and finally clicks a user named "zhangyuelie", the program may record a group of information "zhangyuelie" and "zhangleleie" in the search. Collecting all search click records in a period of time, calculating the statistical probability of each pronunciation, such as counting the search click behavior of the user, searching for "zhanglelei" 400 times, finally clicking "zhanglelei", searching for "zhangyuelilei" 100 times, and finally clicking "zhanglelei", the statistical probability of "le" reading as "le" may be 400/(100+400) to 0.8, and the statistical probability of "le" reading as "yue" may be 100/(100+400) to 0.2.
In order to reduce the statistics and make the statistics result reflect the latest search preference of the user, a statistical period may be set, for example, one day, one week, one month, etc., the search click behavior data in the latest statistical period is counted each time, the latest statistical probability is calculated, and the current probability may be calculated according to the statistical probability in each statistical period and the initial probability.
In an exemplary embodiment, the current probability may be calculated as a weighted average of the statistical probability for each statistical period and the initial probability: for example, the initial probability of "le" reading "yue" and "le" is both 0.5; the statistical probability of "yue" in the first statistic is 0.8, and "le" is 0.2; the second statistic "yue" was 0.85 and "le" was 0.15. The current probability of "yue" can be calculated by:
[0.85×1+0.8×(1-0.1)+0.5×(1-0.1×2)]/(1+0.9+0.8)=0.73
the current probability of "le" may be:
[0.15×1+0.2×(1-0.1)+0.5×(1-0.1×2)]/(1+0.9+0.8)=0.27
the statistical probability weight coefficient in the latest statistical period may be 1, and the previous statistical probability weight coefficients are attenuated by 0.1 one by one in the order of time from near to far.
Therefore, the current probability is calculated by the calculation method, only the statistical probability of the latest 10 statistical cycles participates in the calculation, and when the statistical cycle is more than or equal to 10, the weight coefficient of the initial probability is attenuated to 0, so that the calculation result of the current probability is not influenced any more. Therefore, for uncommon readings, there is an initial probability, if a corresponding search click does not occur once in actual use, for example, the user does not search for "wax" once by "xi", after 10 statistical cycles, the current probability that "wax" reads as "xi" is 0, and the reading can be considered as a redundant option and removed from the search result. In an exemplary embodiment, updating the object identification-pinyin index based on the current probability of each pronunciation of the corresponding polyphonic word may include: and removing the object identifier-pinyin index corresponding to the pronunciation with the current probability of 0 in the corresponding polyphone. When the user searches for "xi" after the removal, the search result of "wax" may not appear.
It should be noted that, in the above embodiment, the value of each attenuation of the weight coefficient may be adjusted, for example, when the statistical period is longer, it is desirable to assign a higher weight coefficient to a recent statistical period, the attenuation value of the weight coefficient may be set to be larger, such as 0.2, 0.5, etc., or when the statistical period is shorter, it is desirable to participate in weighted average calculation of the current probability, and the attenuation value of the weight coefficient may be set to be smaller, such as 0.05, 0.02, etc. For the case that the weight coefficient attenuation value is other value, the time range in which the initial probability affects the current probability may also be different, for example, when the weight coefficient attenuation value is 0.2, the initial probability may affect the current probability within 5 statistical periods, or when the weight coefficient attenuation value is 0.05, the initial probability may affect the current probability within 20 statistical periods, and so on.
In an exemplary embodiment, to simplify the calculation process of the current probability, the current probability may be calculated by weighted averaging the statistical probability of the latest statistical period and the current probability of the last statistical period. For example, the initial probabilities of "le" reading "yue" and "le" are both 0.5; the statistical probability of "yue" in the first statistic is 0.8, and "le" is 0.2, then the current probability of "yue" at this time may be:
[0.8 × 1+0.5 × (1-0.2) ]/(1+0.8) ═ 0.67 ("le" is a little less calculated)
The second statistic "yue" was 0.85 and "le" was 0.15. Then the current probability of "yue" may be:
[0.85 × 1+0.67 × (1-0.2) ]/(1+0.8) ═ 0.77 ("le" is a rough calculation)
The statistical probability weighting coefficient in the latest statistical period may be 1, and the weighting coefficient is attenuated by 0.2 in the calculation of the current probability of the last statistical period. The initial probability can always affect the current probability, if the initial probability of "xi" read as "wax" is 0.1, and the search click behavior from "xi" to "wax" is not counted in the actual use, when the weight coefficient is attenuated by 0.2 and 0.1, the current probability of the uncommon reading "xi" of each statistical period is as shown in table 1. It can be seen that the probability cannot be reduced to 0 no matter how many statistical cycles pass. Thus, to remove redundant search results, in an exemplary embodiment, updating the object identification-pinyin index based on the current probability of each pronunciation of the corresponding polyphone may include: and removing the object identification-pinyin index corresponding to the pronunciation of which the current probability is lower than a probability threshold value in the corresponding polyphone. The probability threshold is the lower limit of the probability of judging the pronunciation to be the common pronunciation, and if the probability threshold is lower than the lower limit, the uncommon pronunciation can be judged. For example, when the weight coefficient is attenuated by 0.2, the probability threshold may be set to be 3.0e-5, as shown in table 1, after the 10 th statistical period, the current probability of the uncommon reading is lower than the probability threshold, and the corresponding object identifier-pinyin index may be removed. The probability threshold value may be set to various values according to the calculation method of the current probability and the actual application.
TABLE 1
Figure BDA0001602413400000101
Figure BDA0001602413400000111
Exemplary embodiments of the present disclosure also provide a pinyin-based retrieval method, which may be used to retrieve semantically-free chinese characters or chinese character combinations, such as names, enterprise names, restaurant names, cell names, etc., through pinyin. As shown in fig. 5, the method may include the steps of: step S510, receiving an input search request, wherein the search request comprises pinyin of a target object identifier; step S520, obtaining a search result according to an object identifier-pinyin index and the search request in a pre-established pinyin database, wherein the pinyin database comprises probability information of each pronunciation of polyphone characters in the object identifier; step S530, sorting the search results according to the probability information; wherein the probability information is obtained according to the initial probability of each pronunciation of the polyphone and the historical search click behavior data.
The search result refers to all Chinese character combinations mapped with the pinyin contained in the search request, and can be sorted according to the current probability of the pinyin corresponding to the search request of each Chinese character combination. For example, searching for "yuelei", there are two users "zhangyueli" and "zhangleley", since the probability of "zhanglei" reading "yuelei" is 1, and the probability of "le" reading "yuelei" calculated from the statistical results in the foregoing embodiments is 0.73 or 0.77, it is possible to rank "zhangyueli" ahead and "zhangleley" behind in the search results.
It should be noted that the method of this embodiment may be applied to a server including a pinyin database, and provide a pinyin retrieval service by receiving a search request sent by a terminal device and sending a search result to the terminal device, and may also be applied to the terminal device, where the terminal device may directly obtain a search result from a built-in pinyin database, or may obtain a search result by sending a search request to the server.
The exemplary embodiment of the present disclosure further provides a device for establishing a pinyin database, which may be applied to a server providing data interaction, or may be applied to a terminal device installed with a client program. As shown in fig. 6, the apparatus 600 for establishing a pinyin database may include: a database construction module 610, configured to construct a pinyin database, where the pinyin database includes each pronunciation of a polyphone in a Chinese character and an initial probability thereof; an index establishing module 620, configured to establish an object identifier-pinyin index according to the pinyin database; a probability statistic module 630, configured to search click behavior data according to history, and obtain a statistic probability of each pronunciation of a corresponding polyphone; a probability obtaining module 640, configured to obtain a current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability; an index updating module 650, configured to update the object identifier-pinyin index according to the current probability of each pronunciation of the corresponding polyphone.
In an exemplary embodiment, the database construction module may include: an object identifier dividing unit, configured to divide the object identifier into a first part and a second part; an initial probability setting unit for setting a first initial probability for a first reading of a polyphonic word in the first portion and a second initial probability for a second reading of the polyphonic word in the first portion; the first initial probability is greater than the second initial probability, and the sum of the first initial probability and the second initial probability is a preset constant.
In an exemplary embodiment, the initial probability setting unit may be further configured to set a same third initial probability for each reading of the polyphones in the second part; wherein the sum of the third initial probabilities for each pronunciation is the predetermined constant.
In an exemplary embodiment, the probability statistics module may include: the search request counting unit is used for counting search requests of keywords input by a user and including pinyin; the object identification recording unit is used for recording the object identification clicked corresponding to the search request; and the statistical probability determining unit is used for obtaining the statistical probability of each pronunciation of the corresponding polyphone according to the search request and the object identification clicked correspondingly.
In an exemplary embodiment, the index update module may include: and the uncommon pronunciation cleaning unit is used for removing the object identification-pinyin index corresponding to the pronunciation of which the current probability is lower than a probability threshold value in the corresponding polyphonic characters.
The exemplary embodiment of the present disclosure also provides a pinyin-based retrieval apparatus, which may be applied to a server providing data interaction, or may be applied to a terminal device installed with a client program. As shown in fig. 7, the pinyin-based retrieval apparatus 700 may include: a request receiving module 710, configured to receive an input search request, where the search request includes pinyin of a target object identifier; a search result obtaining module 720, configured to obtain a search result according to an object identifier-pinyin index in a pre-established pinyin database and the search request, where the pinyin database includes probability information of each pronunciation of a polyphone in the object identifier; and a result sorting module 730, configured to sort the search results according to the probability information. Wherein the probability information is obtained according to the initial probability of each pronunciation of the polyphone and the historical search click behavior data.
The details of each module/unit in the above apparatus for establishing a pinyin database and the apparatus based on pinyin retrieval have been described in detail in the embodiments of the corresponding method, and are not described herein again.
In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 8, electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, a bus 830 connecting different system components (including the memory unit 820 and the processing unit 810), and a display unit 840.
Wherein the storage unit stores program code that is executable by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification. For example, the processing unit 810 may perform the steps as shown in fig. 1: step S110, a pinyin database is established, and the pinyin database comprises each pronunciation and the initial probability of polyphone characters in the Chinese characters; step S120, establishing an object identifier-pinyin index according to the pinyin database; step S130, searching click behavior data according to history to obtain the statistical probability of each pronunciation of the corresponding polyphone; step S140, obtaining the current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability; step S150, updating the object identification-pinyin index according to the current probability of each pronunciation of the corresponding polyphone.
The storage unit 820 may include readable media in the form of volatile storage units, such as a random access storage unit (RAM)821 and/or a cache storage unit 822, and may further include a read only storage unit (ROM) 823.
Storage unit 820 may also include a program/utility 824 having a set (at least one) of program modules 825, such program modules 825 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.
The electronic device 800 may also communicate with one or more external devices 1000 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur via input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.
In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.
Referring to fig. 9, a program product 900 for implementing the above method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
Furthermore, the above-described figures are merely schematic illustrations of processes included in methods according to exemplary embodiments of the present disclosure, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is to be limited only by the terms of the appended claims.

Claims (9)

1. A method for establishing a pinyin database, comprising:
constructing a pinyin database, wherein the pinyin database comprises each pronunciation and the initial probability of polyphone in the Chinese characters;
establishing an object identifier-pinyin index according to the pinyin database;
searching click behavior data according to history to obtain statistical probability of each pronunciation of corresponding polyphone;
obtaining the current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability, comprising:
setting a statistical period;
determining the weight coefficient of the statistical probability in the latest statistical period, wherein the weight coefficients of the previous statistical probabilities are attenuated one by one from the near to the far according to the time sequence;
adjusting the value of each attenuation of the weight coefficient;
calculating the current probability of the statistical probability of each statistical period and the initial probability weighted average;
updating the object identifier-pinyin index according to the current probability of each pronunciation of the corresponding polyphone;
and removing the object identification-pinyin index corresponding to the pronunciation of which the current probability is lower than a probability threshold value in the corresponding polyphone.
2. The method of claim 1, wherein the constructing a pinyin database that includes each pronunciation of a polyphonic character in a chinese character and its initial probability comprises:
dividing the object identification into a first part and a second part;
setting a first initial probability for a first pronunciation of a polyphonic word in the first portion;
setting a second initial probability for a second reading of the polyphonic word in the first portion;
the first initial probability is greater than the second initial probability, and the sum of the first initial probability and the second initial probability is a preset constant.
3. The method of claim 2, wherein the constructing a pinyin database that includes each pronunciation of a polyphonic character in a chinese character and its initial probability further comprises:
setting the same third initial probability for each pronunciation of the polyphones in the second part;
wherein the sum of the third initial probabilities for each pronunciation is the predetermined constant.
4. The method of claim 1, wherein the searching click behavior data based on history to obtain statistical probabilities for each pronunciation of the corresponding polyphonic word comprises:
counting search requests of keywords input by a user and including pinyin;
recording the clicked object identification corresponding to the search request;
and obtaining the statistical probability of each pronunciation of the corresponding polyphone according to the search request and the corresponding clicked object identifier.
5. The method according to any one of claims 1-4, comprising:
receiving an input search request, wherein the search request comprises pinyin of a target object identifier;
obtaining a search result according to an object identifier-pinyin index in a pre-established pinyin database and the search request, wherein the pinyin database comprises probability information of each pronunciation of polyphone in the object identifier;
sorting the search results according to the probability information;
wherein the probability information is obtained according to the initial probability of each pronunciation of the polyphone and the historical search click behavior data.
6. An apparatus for building a pinyin database, comprising:
the database construction module is used for constructing a pinyin database, and the pinyin database comprises each pronunciation and the initial probability of polyphone in the Chinese characters;
the index establishing module is used for establishing an object identifier-pinyin index according to the pinyin database;
the probability statistic module is used for searching click behavior data according to history to obtain the statistic probability of each pronunciation of the corresponding polyphone;
a probability obtaining module, configured to obtain a current probability of each pronunciation of the corresponding polyphone according to the initial probability and the statistical probability, including:
setting a statistical period;
determining the weight coefficient of the statistical probability in the latest statistical period, wherein the weight coefficients of the previous statistical probabilities are attenuated one by one from the near to the far according to the time sequence;
adjusting the value of each attenuation of the weight coefficient;
calculating the current probability of the statistical probability of each statistical period and the initial probability weighted average;
the index updating module is used for updating the object identification-pinyin index according to the current probability of each pronunciation of the corresponding polyphone;
and removing the object identification-pinyin index corresponding to the pronunciation of which the current probability is lower than a probability threshold value in the corresponding polyphone.
7. The apparatus of claim 6, comprising:
the request receiving module is used for receiving an input search request, wherein the search request comprises pinyin of a target object identifier;
a search result acquisition module, configured to acquire a search result according to an object identifier-pinyin index and the search request in a pinyin database established in advance, where the pinyin database includes probability information of each pronunciation of a polyphone in the object identifier;
the result sorting module is used for sorting the search results according to the probability information;
wherein the probability information is obtained according to the initial probability of each pronunciation of the polyphone and the historical search click behavior data.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the program realizes the method steps of any of claims 1-5 when executed by the processor.
9. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 5.
CN201810229847.5A 2018-03-20 2018-03-20 Method and device for establishing pinyin database, electronic equipment and storage medium Active CN108416055B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810229847.5A CN108416055B (en) 2018-03-20 2018-03-20 Method and device for establishing pinyin database, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810229847.5A CN108416055B (en) 2018-03-20 2018-03-20 Method and device for establishing pinyin database, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN108416055A CN108416055A (en) 2018-08-17
CN108416055B true CN108416055B (en) 2021-05-25

Family

ID=63133030

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810229847.5A Active CN108416055B (en) 2018-03-20 2018-03-20 Method and device for establishing pinyin database, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN108416055B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177317A (en) * 2019-12-20 2020-05-19 吕梁学院 Literature theory rapid retrieval query system and method
CN111078898B (en) * 2019-12-27 2023-08-08 出门问问创新科技有限公司 Multi-tone word annotation method, device and computer readable storage medium
CN111145724B (en) * 2019-12-31 2022-08-19 出门问问信息科技有限公司 Polyphone marking method and device and computer readable storage medium
CN112395844B (en) * 2020-11-16 2024-01-30 北京字节跳动网络技术有限公司 Pinyin generation method and device and electronic equipment
CN115273809A (en) * 2022-06-22 2022-11-01 北京市商汤科技开发有限公司 Training method of polyphone pronunciation prediction network, and speech generation method and device
CN115905297B (en) * 2023-01-04 2023-12-15 脉策(上海)智能科技有限公司 Method, apparatus and medium for retrieving data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682031A (en) * 2011-03-17 2012-09-19 新奥特(北京)视频技术有限公司 Method and system of Chinese Pin Yin search suggest based on relational database
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN106201011A (en) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 The search method of the communication information and device and terminal unit
CN107291730A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word
CN107515850A (en) * 2016-06-15 2017-12-26 阿里巴巴集团控股有限公司 Determine the methods, devices and systems of polyphone pronunciation

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050163B (en) * 2013-03-11 2017-08-25 广州帷策智能科技有限公司 Content recommendation system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102682031A (en) * 2011-03-17 2012-09-19 新奥特(北京)视频技术有限公司 Method and system of Chinese Pin Yin search suggest based on relational database
CN104142909A (en) * 2014-05-07 2014-11-12 腾讯科技(深圳)有限公司 Method and device for phonetic annotation of Chinese characters
CN107291730A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Method, device and the probabilistic dictionaries construction method of correction suggestion are provided query word
CN107515850A (en) * 2016-06-15 2017-12-26 阿里巴巴集团控股有限公司 Determine the methods, devices and systems of polyphone pronunciation
CN106201011A (en) * 2016-06-30 2016-12-07 北京奇虎科技有限公司 The search method of the communication information and device and terminal unit

Also Published As

Publication number Publication date
CN108416055A (en) 2018-08-17

Similar Documents

Publication Publication Date Title
CN108416055B (en) Method and device for establishing pinyin database, electronic equipment and storage medium
CN110362372B (en) Page translation method, device, medium and electronic equipment
US7908287B1 (en) Dynamically autocompleting a data entry
KR101099278B1 (en) System and method for user modeling to enhance named entity recognition
US20060212433A1 (en) Prioritization of search responses system and method
RU2726728C2 (en) Identification of query templates and associated aggregate statistics among search queries
US20120166438A1 (en) System and method for recommending queries related to trending topics based on a received query
KR20160030943A (en) Performing an operation relative to tabular data based upon voice input
JP2008159044A (en) System and method for adaptive spell check
US20100161591A1 (en) System and method of geo-based prediction in search result selection
US9262446B1 (en) Dynamically ranking entries in a personal data book
WO2010144704A1 (en) Predictive searching and associated cache management
CN112269816B (en) Government affair appointment correlation retrieval method
CN112487150B (en) File management method, system, storage medium and electronic equipment
US11226972B2 (en) Ranking collections of document passages associated with an entity name by relevance to a query
KR20060116042A (en) Personalized search method using cookie information and system for enabling the method
TW201915777A (en) Financial analysis system and method for unstructured text data
US10073839B2 (en) Electronically based thesaurus querying documents while leveraging context sensitivity
CN114840671A (en) Dialogue generation method, model training method, device, equipment and medium
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
JP2018501540A (en) Stopword identification method and apparatus
CN111538815A (en) Text query method, device, equipment and storage medium
CN111435406A (en) Method and device for correcting database statement spelling errors
US20180307744A1 (en) Named entity-based category tagging of documents
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant