CN102193920A - Name word stock generating method and device as well as text input system - Google Patents

Name word stock generating method and device as well as text input system Download PDF

Info

Publication number
CN102193920A
CN102193920A CN2010101180249A CN201010118024A CN102193920A CN 102193920 A CN102193920 A CN 102193920A CN 2010101180249 A CN2010101180249 A CN 2010101180249A CN 201010118024 A CN201010118024 A CN 201010118024A CN 102193920 A CN102193920 A CN 102193920A
Authority
CN
China
Prior art keywords
name
thesaurus
word
people
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010101180249A
Other languages
Chinese (zh)
Other versions
CN102193920B (en
Inventor
宋爱元
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201010118024.9A priority Critical patent/CN102193920B/en
Publication of CN102193920A publication Critical patent/CN102193920A/en
Application granted granted Critical
Publication of CN102193920B publication Critical patent/CN102193920B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention is suitable for the computer software field, and provides a name word stock generating method and device as well as a text input system. The method comprises the following steps of: obtaining a user word stock; extracting the name corpus from the obtained user word stock; screening name word groups from the name corpus to generate the name word stock; and adjusting the word frequency of the name word group in the name word stock. By extracting the name corpus from the user word stock, screening the name word groups and establishing the name word stock, the better name word stock can be generated since the name corpus is not limited, the rationality and accuracy rate of the input method can be improved when outputting the name.

Description

A kind of people's thesaurus generation method, device and character input system
Technical field
The invention belongs to computer software fields, relate in particular to a kind of people's thesaurus generation method, device and character input system.
Background technology
Input method is meant the coding method of adopting for other equipment such as various symbols input computing machines or mobile phone.The coding method of Chinese character input all is to adopt sound, shape, justice and specific key are interrelated basically, makes up the input of finishing Chinese character according to different Chinese character again.
At present, a lot of input method of Chinese character are all supported proprietary name input pattern, by adding up the surname of all existence of China, from schoolmates' address book, mailbox or other database, extract people's name, form the name language material, carry out the training of people's thesaurus according to the name language material, extraction may be as the word and the speech of name, and obtain the frequency that they occur, form people's thesaurus of input method.In when input, input method is carried out the group speech of surname and name, the name that obtains exporting according to the data of the pinyin string of user's input and people's thesaurus.
Above-mentionedly can only find limited language material based on the name corpus statistics, no matter be schoolmates' address book or mailbox, perhaps other the name language material name that can get access to all is limited, causes language material training result, name output result and accuracy rate all to be very limited.
Summary of the invention
The purpose of the embodiment of the invention is to provide a kind of people's thesaurus generation method, and it is limited to be intended to solve existing name language material, the problem that causes language material training result, name output result and accuracy rate to be restricted.
The embodiment of the invention is achieved in that a kind of people's thesaurus generation method, said method comprising the steps of:
Obtain user thesaurus;
From the user thesaurus that is obtained, extract the name language material;
Screening name group speech generates people's thesaurus from described name language material;
Adjust the word frequency of name group speech in described people's thesaurus.
Another purpose of the embodiment of the invention is to provide a kind of name word stock generation device, and described device comprises:
The user thesaurus acquiring unit is used to obtain user thesaurus;
Name language material extraction unit is used for extracting the name language material from the user thesaurus that is obtained;
People's thesaurus generation unit is used for generating people's thesaurus from described name language material screening name group speech;
People's thesaurus is used to store described name group speech; And
The word frequency adjustment unit is used for adjusting the word frequency of name group speech at described people's thesaurus.
Another purpose of the embodiment of the invention is to provide a kind of character input system that comprises above-mentioned name word stock generation device.
The embodiment of the invention is extracted the name language material from user thesaurus, people's thesaurus set up in screening name group speech, makes that the name language material is unrestricted, can generate better people's thesaurus, improves rationality and the accuracy rate of input method when the output name.
Description of drawings
Fig. 1 is the realization flow figure of people's thesaurus generation method of providing of the embodiment of the invention;
Fig. 2 is the structural drawing of the name word stock generation device that provides of the embodiment of the invention.
Embodiment
In order to make purpose of the present invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with drawings and Examples.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.
The embodiment of the invention is extracted the name language material from the user thesaurus of input method, and screening generates people's thesaurus, and people's thesaurus is trained optimization, and the name Word library updating after training is optimized improves the rationality and the accuracy rate of name group speech to the user who uses input method.
Fig. 1 shows the realization flow of people's thesaurus generation method that the embodiment of the invention provides, and details are as follows:
In step S101, obtain user thesaurus;
During user's input characters, input method can be recorded in subscriber's local with the word string of user's input, forms user thesaurus.In embodiments of the present invention, can automatically user thesaurus be reported background server by input method, perhaps initiatively the user thesaurus of this locality is reported background server by the user, background server obtains the user thesaurus of input method or reporting of user, based on user thesaurus, from user thesaurus, extract the name language material.
In step S102, from the user thesaurus that is obtained, extract the name language material;
In embodiments of the present invention, realize name group speech, at first will collect the surname of name, to distinguish a speech or whether a phrase (2~4 words) is a name.Because the surname of China is limited, by being easy to finish, collect the name surname based on artificially collecting of One Hundred Family Names, generate the surname dictionary.
The embodiment of the invention is extracted the name language material according to surname from user thesaurus after obtaining user thesaurus.
Surname has the branch of monosyllabic name and two-character surname, and name generally has the branch of single-character given name and two-character given name, if monosyllabic name, then name generally mostly is most three words, and minimum is two words, if two-character surname, then name generally mostly is most four words, and minimum is three words.The embodiment of the invention can be passed through these name rules when extracting the name language material, utilize the surname in the surname dictionary to search in user thesaurus as key word, according to the name of a word or two words, extracts the name language material that may exist in the user thesaurus.For example the word string in the user thesaurus is " Zhang San has gone to school ", and can judge " opening " from the surname dictionary is a surname, and " three " and " three go " that then extract the back are as the name language material.
In step S103, screening name group speech generates people's thesaurus from the name language material;
In embodiments of the present invention, because the name language material is simply to divide from user thesaurus according to the name of one or two word, may have some so is not the data of name group speech, for example " Zhang San goes ", therefore need screen the name language material, filter out the data that are not name group speech.
At first, the single-character given name word directly as name group speech, is write people's thesaurus.
The name language material of single-character given name, for example in " Zhang San ", these " three " also can be used as the part of name even if be not a name, therefore for the name language material of single-character given name, can regard as name group speech, and people's thesaurus writes direct.
Add up for the surname that adopts father and mother two sides, and then add the name of a word, for example " Yan Yangtian ", the name language material of this part is as the name group speech of the individual character people's thesaurus that also writes direct.
Secondly, the user is when input characters, and for not having in the user thesaurus, and it is complete to need the name of often input often once to spell, and deposits user thesaurus in, uses with after convenient.Therefore, from the name group speech of making, people's thesaurus can write direct for the user in the user thesaurus.
In addition, significant word generally can be used as the name group speech of double word, in embodiments of the present invention, with the name language material of the double word that occurs and the core word bank of input method, perhaps other more accurately the word of dictionary compare, filter out significant word and the nonsensical word of possibility, a name group speech regarded as basically in significant word.For example " three go " in " Zhang San goes " is not a significant word, can not find in the core word bank of input method, therefore keeps further screening.If there is a double word to be " longevity ", because " longevity " can find, think that then " longevity " is a name group speech in the dictionary of standard, write people's thesaurus.
Being subject to the capacity and the degree of accuracy of standard dictionary, may not be in the name language material of name group speech for what screen, and may also have greatly also is name group speech.In embodiments of the present invention, for this part name language material, can screen by artificial participation, can write people's thesaurus as the nonsense words of name group speech, to guarantee the precision of name group speech in the dictionary, for example " three is rich " in " Zhang Sanfeng " is not a significant word, by artificial participation screening, it as a name group speech, is write people's thesaurus.
For derogatory term or the bad word of implication, for example " bad ", the derogatory term of " wretch " and so on generally can not occur as name group speech, but may occur in the name language material of collecting, can screen and investigate removal by artificial participation, perhaps remove by collecting relevant word database.
In step S104, adjust the frequency of occurrences of name group speech in people's thesaurus.
In embodiments of the present invention, can obtain relatively comprehensively people's thesaurus by the way.By the probability that statistics name group speech occurs, adjust the word frequency of name group speech in people's thesaurus, to guarantee the quality of name group speech.
When specific implementation, background server is according to the user thesaurus that extracts, carry out the participle of surname and name, the probability that statistics name group speech occurs in the different user dictionary, basis as name group word frequency rate in the input method, thereby adjust the word frequency of name group speech in people's thesaurus, for example a name group speech occurs in a plurality of user thesaurus, illustrate that then this name group speech is the high-frequency name group speech that a lot of users are concerned about, then this name group speech is come the front of people's thesaurus, preferential output is shown to user's selection when the user imports.
In addition, according to being named custom, owing to generally all can consider to blurt out when being named, the whole identical words of for example seldom useful tone are as name, the data of this part can only be references, when this class speech occurs as name, its frequency in people's thesaurus can suitably be turned down.
In embodiments of the present invention, after the word frequency adjustment of people's thesaurus finished, the name Word library updating after the word frequency adjustment can being finished was to the user.
People's thesaurus can be used as an individual files, upgrades for user's download.During specific implementation, can in people's thesaurus, write a version number according to certain rule, for example can be with main. pair is represented, also can represent with a numbering, constantly increases progressively, and perhaps does version number according to the date of formation of people's thesaurus.
After input method starts, call its automatic refresh routine and background server communication, the version number information of background server verification people thesaurus judges whether to satisfy other update condition in case of necessity, for example the input method for some version may not need to upgrade, and perhaps can not upgrade people's thesaurus.
When finding to need to upgrade after refresh routine and the background server verification automatically, then download people's thesaurus to this locality, and cover the people's thesaurus file in the local original installation kit from background server.
Fig. 2 shows the structure of the name word stock generation device that the embodiment of the invention provides, and for convenience of explanation, only shows the part relevant with the embodiment of the invention.
This name word stock generation device may operate in the background server in the various character input systems, from the local user vocabulary of input method, extract the name language material, screening generates people's thesaurus, people's thesaurus is trained optimization, and the name Word library updating after training can being optimized improves the rationality and the accuracy rate of name group speech to the user who uses input method.
User thesaurus acquiring unit 21 obtains user thesaurus, and user thesaurus specifically can be reported automatically by input method, and perhaps the user initiatively reports.
Name language material extraction unit 22 extracts the name language material from the user thesaurus that is obtained.
As one embodiment of the present of invention, storage surname information in the surname dictionary 221, the name language material is searched extraction module 222 according to the surname in the surname dictionary 221, and name rule, search in user thesaurus and extract the name language material, specific implementation repeats no more as mentioned above.
People's thesaurus generation unit 23 is screening name group speech from the name language material that name language material extraction unit 22 extracts, and generates people's thesaurus 24, and specific implementation repeats no more as mentioned above.
The name group speech of people's thesaurus 24 storage people thesaurus generation units 23 screenings.
As one embodiment of the present of invention, name group speech in people's thesaurus 24 comprises the single-character given name word of single-character given name word, two surname stacks, name group speech, the significant word that the user makes certainly, perhaps through the nonsense words of artificial screening, specific implementation repeats no more as mentioned above.
Word frequency adjustment unit 25 is adjusted the word frequency of name group speech in people's thesaurus 24.
When the name group speech in 25 pairs of people's thesaurus 24 of word frequency adjustment unit carried out the word frequency adjustment, as one embodiment of the present of invention, name word-dividing mode 251 was carried out the participle of surname and name according to the user thesaurus and the surname dictionary 221 that extract.
The probability that probability of occurrence statistical module 252 statistics name group speech occur in the different user dictionary.
The probability that word frequency adjusting module 253 occurs in the different user dictionary according to name group speech is adjusted the word frequency of name group speech in people's thesaurus 24.
As one embodiment of the present of invention, after the word frequency adjustment of people's thesaurus was finished, the name Word library updating after the word frequency adjustment can being finished was to the user.
The name Word library updating that name Word library updating unit 26 will be adjusted after the word frequency arrives the user, and concrete renewal process repeats no more as mentioned above.
The embodiment of the invention is extracted the name language material from user thesaurus, people's thesaurus set up in screening name group speech, makes that the name language material is unrestricted, can generate better people's thesaurus, improves rationality and the accuracy rate of input method when the output name.Simultaneously, the name Word library updating is user-friendly to the user.
The above only is preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of being done within the spirit and principles in the present invention, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims (11)

1. people's thesaurus generation method is characterized in that, said method comprising the steps of:
Obtain user thesaurus;
From the user thesaurus that is obtained, extract the name language material;
Screening name group speech generates people's thesaurus from described name language material;
Adjust the word frequency of name group speech in described people's thesaurus.
2. the method for claim 1 is characterized in that, the described step of extracting the name language material from the user thesaurus that is obtained is specially:
Generate the surname dictionary;
According to surname in the described surname dictionary and name rule, in described user thesaurus, search and extract the name language material.
3. the method for claim 1 is characterized in that, the described name group speech that from described name language material, screens, and the step that generates people's thesaurus is specially:
When having the single-character given name word in the described name language material, this single-character given name word as name group speech, is write described people's thesaurus;
When having the single-character given name word of two surname stacks in the described name language material, this single-character given name word as name group speech, is write described people's thesaurus;
When having the user in the described name language material, this name group speech is write described people's thesaurus from the name group speech made;
When having double word in the described name language material, screen significant word, perhaps through the nonsense words of artificial screening, write described people's thesaurus as name group speech.
4. the method for claim 1 is characterized in that, the step of the word frequency of described adjustment name group speech in described people's thesaurus is specially:
According to the user thesaurus that extracts, carry out the participle of surname and name;
The probability that statistics name group speech occurs in the different user dictionary;
According to the probability that name group speech occurs, adjust the word frequency of described name group speech in people's thesaurus in the different user dictionary.
5. the method for claim 1 is characterized in that, described method further comprises the steps:
Name Word library updating after the adjustment word frequency is arrived the user.
6. a name word stock generation device is characterized in that, described device comprises:
The user thesaurus acquiring unit is used to obtain user thesaurus;
Name language material extraction unit is used for extracting the name language material from the user thesaurus that is obtained;
People's thesaurus generation unit is used for generating people's thesaurus from described name language material screening name group speech;
People's thesaurus is used to store described name group speech; And
The word frequency adjustment unit is used for adjusting the word frequency of name group speech at described people's thesaurus.
7. device as claimed in claim 6 is characterized in that, described name language material extraction unit comprises:
The surname dictionary is used to store surname information; And
The name language material is searched extraction module, is used for surname and name rule according to described surname dictionary, searches in described user thesaurus and extracts the name language material.
8. device as claimed in claim 6 is characterized in that, the name group speech in described people's thesaurus comprises the single-character given name word, user of single-character given name word, the stack of two surnames from name group speech, the significant word made, perhaps through the nonsense words of artificial screening.
9. device as claimed in claim 6 is characterized in that, described word frequency adjustment unit comprises:
The name word-dividing mode is used for carrying out the participle of surname and name according to the user thesaurus that extracts;
The probability of occurrence statistical module is used for adding up the probability that name group speech occurs at the different user dictionary; And
The word frequency adjusting module is used for according to name group speech adjusting the word frequency of described name group speech in people's thesaurus at the probability that the different user dictionary occurs.
10. device as claimed in claim 6 is characterized in that, described device also comprises:
Name Word library updating unit is used for the name Word library updating after the adjustment word frequency to the user.
11. character input system that comprises the name word stock generation device of claim 6.
CN201010118024.9A 2010-03-04 2010-03-04 A kind of name word stock generating method, device and character input system Active CN102193920B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010118024.9A CN102193920B (en) 2010-03-04 2010-03-04 A kind of name word stock generating method, device and character input system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010118024.9A CN102193920B (en) 2010-03-04 2010-03-04 A kind of name word stock generating method, device and character input system

Publications (2)

Publication Number Publication Date
CN102193920A true CN102193920A (en) 2011-09-21
CN102193920B CN102193920B (en) 2016-01-20

Family

ID=44602003

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010118024.9A Active CN102193920B (en) 2010-03-04 2010-03-04 A kind of name word stock generating method, device and character input system

Country Status (1)

Country Link
CN (1) CN102193920B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice
CN103076894A (en) * 2012-12-31 2013-05-01 百度在线网络技术(北京)有限公司 Method and equipment for building input entries for object identity information according to object identity information
CN104023124A (en) * 2014-05-14 2014-09-03 上海卓悠网络科技有限公司 Method and device for automatically identifying and extracting a name in short message
CN106156051A (en) * 2015-03-27 2016-11-23 深圳市腾讯计算机系统有限公司 Build the method and device of name language material identification model
CN108399013A (en) * 2018-03-16 2018-08-14 北京搜狗科技发展有限公司 A kind of user's word adding method and device
CN109814732A (en) * 2018-12-29 2019-05-28 平安科技(深圳)有限公司 The treating method and apparatus and contacts list treating method and apparatus of dictionary
CN110781288A (en) * 2019-10-30 2020-02-11 安阳师范学院 Method and device for composing words by Chinese characters
CN110990521A (en) * 2019-12-05 2020-04-10 李城华 Name word bank generating method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002117028A (en) * 2000-10-05 2002-04-19 Nippon Telegr & Teleph Corp <Ntt> Device and method for dictionary generation and recording medium with recorded dictionary generating program
CN1924858A (en) * 2006-08-09 2007-03-07 北京搜狗科技发展有限公司 Method and device for fetching new words and input method system
CN101359254A (en) * 2007-08-03 2009-02-04 北京搜狗科技发展有限公司 Character input method and system for enhancing input efficiency of name entry

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002117028A (en) * 2000-10-05 2002-04-19 Nippon Telegr & Teleph Corp <Ntt> Device and method for dictionary generation and recording medium with recorded dictionary generating program
CN1924858A (en) * 2006-08-09 2007-03-07 北京搜狗科技发展有限公司 Method and device for fetching new words and input method system
CN101359254A (en) * 2007-08-03 2009-02-04 北京搜狗科技发展有限公司 Character input method and system for enhancing input efficiency of name entry

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722525A (en) * 2012-05-15 2012-10-10 北京百度网讯科技有限公司 Methods and systems for establishing language model of address book names and searching voice
CN103076894A (en) * 2012-12-31 2013-05-01 百度在线网络技术(北京)有限公司 Method and equipment for building input entries for object identity information according to object identity information
CN103076894B (en) * 2012-12-31 2016-05-18 百度在线网络技术(北京)有限公司 A kind of for build the method and apparatus of input entry according to object id information
CN104023124A (en) * 2014-05-14 2014-09-03 上海卓悠网络科技有限公司 Method and device for automatically identifying and extracting a name in short message
CN106156051A (en) * 2015-03-27 2016-11-23 深圳市腾讯计算机系统有限公司 Build the method and device of name language material identification model
CN106156051B (en) * 2015-03-27 2019-08-13 深圳市腾讯计算机系统有限公司 Construct the method and device of name corpus identification model
CN108399013A (en) * 2018-03-16 2018-08-14 北京搜狗科技发展有限公司 A kind of user's word adding method and device
CN109814732A (en) * 2018-12-29 2019-05-28 平安科技(深圳)有限公司 The treating method and apparatus and contacts list treating method and apparatus of dictionary
CN110781288A (en) * 2019-10-30 2020-02-11 安阳师范学院 Method and device for composing words by Chinese characters
CN110990521A (en) * 2019-12-05 2020-04-10 李城华 Name word bank generating method

Also Published As

Publication number Publication date
CN102193920B (en) 2016-01-20

Similar Documents

Publication Publication Date Title
CN102193920A (en) Name word stock generating method and device as well as text input system
CN109783651B (en) Method and device for extracting entity related information, electronic equipment and storage medium
Mubarak et al. Using Twitter to collect a multi-dialectal corpus of Arabic
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN106570180B (en) Voice search method and device based on artificial intelligence
CN100424703C (en) Method for obtaining newly encoded character string, input method system and word stock generation device
CN101952824A (en) Method and information retrieval system that the document in the database is carried out index and retrieval that computing machine is carried out
CN102866782A (en) Input method and input method system for improving sentence generating efficiency
CN103678684A (en) Chinese word segmentation method based on navigation information retrieval
JP2020087353A (en) Summary generation method, summary generation program, and summary generation apparatus
WO2008145055A1 (en) The method for obtaining restriction word information, optimizing output and the input method system
CN103577989A (en) Method and system for information classification based on product identification
CN103324626A (en) Method for setting multi-granularity dictionary and segmenting words and device thereof
CN104021198A (en) Relational database information retrieval method and device based on ontology semantic index
CN102236639A (en) System and method for updating language model
CN102253972A (en) Web crawler-based geographical name database maintenance method
CN109902290A (en) A kind of term extraction method, system and equipment based on text information
CN101271449B (en) Method and device for reducing vocabulary and Chinese character string phonetic notation
CN109922131A (en) Date storage method, device, equipment and storage medium based on block chain
CN101340672B (en) Contact searching method, system and mobile terminal
CN115062621A (en) Label extraction method and device, electronic equipment and storage medium
CN108345694B (en) Document retrieval method and system based on theme database
KR20160002199A (en) Issue data extracting method and system using relevant keyword
US11789940B2 (en) Natural language interface to databases
CN101374307A (en) Method and apparatus for updating digital content information of mobile equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN SHIJI LIGHT SPEED INFORMATION TECHNOLOGY

Free format text: FORMER OWNER: TENGXUN SCI-TECH (SHENZHEN) CO., LTD.

Effective date: 20131101

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518044 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20131101

Address after: A Tencent Building in Shenzhen Nanshan District City, Guangdong streets in Guangdong province science and technology 518057 16

Applicant after: Shenzhen Shiji Guangsu Information Technology Co., Ltd.

Address before: Shenzhen Futian District City, Guangdong province 518044 Zhenxing Road, SEG Science Park 2 East Room 403

Applicant before: Tencent Technology (Shenzhen) Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant