CN105488030A - Method and device for obtaining positive Chinese characters - Google Patents

Method and device for obtaining positive Chinese characters Download PDF

Info

Publication number
CN105488030A
CN105488030A CN201510873465.2A CN201510873465A CN105488030A CN 105488030 A CN105488030 A CN 105488030A CN 201510873465 A CN201510873465 A CN 201510873465A CN 105488030 A CN105488030 A CN 105488030A
Authority
CN
China
Prior art keywords
chinese character
positive polarity
name class
name
class vocabulary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510873465.2A
Other languages
Chinese (zh)
Inventor
徐戈
关胤
吴拥民
刘德建
陈宏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian TQ Digital Co Ltd
Original Assignee
Fujian TQ Digital Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian TQ Digital Co Ltd filed Critical Fujian TQ Digital Co Ltd
Priority to CN201510873465.2A priority Critical patent/CN105488030A/en
Publication of CN105488030A publication Critical patent/CN105488030A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a method for obtaining positive Chinese characters. The method comprises the following steps: obtaining all name vocabularies from a given character material; removing common words from the obtained name vocabularies; and collecting remained Chinese characters having high occurrence frequency in a positive Chinese character set. The invention further provides a device for obtaining the positive Chinese characters for realizing the method. By means of the technical scheme in the invention, the positive Chinese characters or the Chinese characters implicit in positive polarity can be rapidly found; and thus, high-quality Chinese character emotional resources for use can be provided.

Description

Obtain the method and apparatus of positive polarity Chinese character
Technical field
The present invention relates to software field, particularly a kind of method and apparatus obtaining positive polarity Chinese character from written material.
Background technology
The affection resources of Chinese character is the important component part of natural language processing, because the number (especially Chinese characters in common use) of Chinese character is not very too many, can consider to adopt the method for artificial mark to carry out the Emotion tagging of Chinese character.Usually, it is believed that and describe that part of speech Chinese character (beautiful, kind etc.) and some verb Chinese characters (love, love etc.) carry positive polarity.But, some are seemed to the Chinese character of partial neutral color, also can have implicit positive polarity, such as " sea ", " my god ", Chinese characters such as " flying ".Conventional artificial mark person is difficult to realize trickle emotion wherein by individual character, cannot reach mark requirement.
Existing technology mainly adopts artificial mark to obtain positive polarity Chinese character.But because Chinese character experienced by the development and evolution of several thousand, some concept difference are very trickle, conventional artificial mark effect cannot reach requirement.Such as, the semantic similitude of " jumping " and " jumping ", but " jumping " contains positive polarity and " jumping " is neutral concept.
Summary of the invention
For this reason, need to provide a kind of technical scheme that can find the Chinese character of positive polarity or implicit positive polarity fast, to provide operational high-quality Chinese character affection resources.
For achieving the above object, inventor provide a kind of method obtaining positive polarity Chinese character, comprise step:
All name class vocabulary is obtained from given written material;
Public words is removed from the name class vocabulary obtained;
Frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set.
Further, in the method for described acquisition positive polarity Chinese character, step " is removed public words " and is specifically comprised from the name class vocabulary obtained: remove the surname in name class vocabulary.
Further, in the method for described acquisition positive polarity Chinese character, step " obtains all name class vocabulary " and specifically comprises from given written material: to given written material, with part of speech annotation tool, participle and part-of-speech tagging are carried out to it, and according to part-of-speech tagging result acquisition name class vocabulary wherein.
Further, in the method for described acquisition positive polarity Chinese character, " frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set " in step specifically to comprise: add up the frequency of occurrences of each Chinese character in remaining Chinese character and by it by sorting from high to low, positive polarity character set listed in the Chinese character frequency of occurrences being positioned at front preset ratio.
Inventor additionally provides a kind of device obtaining positive polarity Chinese character simultaneously, comprises name acquiring unit, public words removal unit and statistic unit;
Described name acquiring unit is used for from given written material, obtain all name class vocabulary;
Described public words removal unit is used for removing public words from the name class vocabulary obtained;
Described statistic unit is for adding up frequency of occurrences height person in remaining Chinese character and being collected into positive polarity character set.
Further, in the device of described acquisition positive polarity Chinese character, public words removal unit is removed public words and is specifically comprised the surname removed in name class vocabulary from the name class vocabulary obtained.
Further, in the device of described acquisition positive polarity Chinese character, name acquiring unit obtains all name class vocabulary and specifically comprises from given written material: to given written material, with part of speech annotation tool, participle and part-of-speech tagging are carried out to it, and according to part-of-speech tagging result acquisition name class vocabulary wherein.
Further, in the device of described acquisition positive polarity Chinese character, statistic unit is added up frequency of occurrences height person in remaining Chinese character and is collected into positive polarity character set and specifically comprises: add up the frequency of occurrences of each Chinese character in remaining Chinese character and by it by sorting from high to low, positive polarity character set listed in the Chinese character frequency of occurrences being positioned at front preset ratio.
Be different from prior art, technique scheme can be found out conventional artificial mark person and be difficult to realize trickle emotion wherein by individual character from any given written material fragment, the positive polarity emotion Chinese character that mark requires cannot be reached, further to utilize as high-quality data resource.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the method for the acquisition positive polarity Chinese character described in an embodiment of the present invention;
Fig. 2 is the structural representation of the device of the acquisition positive polarity Chinese character described in an embodiment of the present invention.
Description of reference numerals:
1-name acquiring unit
The public words removal unit of 2-
3-statistic unit
Embodiment
By describe in detail technical scheme technology contents, structural attitude, realized object and effect, coordinate accompanying drawing to be explained in detail below in conjunction with specific embodiment.
Referring to Fig. 1, is the process flow diagram of the method for the acquisition positive polarity Chinese character described in an embodiment of the present invention; Described method comprises the steps:
S1, from given written material, obtain all name class vocabulary;
S2, from obtain name class vocabulary remove public words;
S3, frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set.
Public words described in step S2 generally can be thought and mainly comprises common prefix or public suffix.In name class vocabulary, situation the most common is common prefix, i.e. surname.Such as, when the name class vocabulary obtained be " open XX ", " Lee XX ", " Ouyang XX " etc., obviously its surname " is opened ", " Lee ", " Ouyang " etc. be common prefix, needs to be removed.Be specially and removed by the surname in name according to surnames list existing in database, remaining is like this exactly simple name, can remove the impact of surname.
In some embodiments, step " obtains all name class vocabulary " and specifically comprises from given written material: to given written material, with part of speech annotation tool, participle and part-of-speech tagging are carried out to it, and according to part-of-speech tagging result acquisition name class vocabulary wherein.In some other embodiment, the mode of the acquisition name class vocabulary that other can also be adopted common obtains name class vocabulary from given written material.Be no matter the method being obtained name class vocabulary by participle and part-of-speech tagging described in these embodiments, or obtain the method for name class vocabulary by other means, all require that there is certain accuracy.Such as, " Mr. Zhang " one word be not just the name class vocabulary that typically can be used for collection positive polarity Chinese character of the present invention.When the result that the means obtaining name class vocabulary obtain is accurate not as mentioned above, the part of the public words of the removal described in step S2 not only needs to remove common prefix (surname) also to be needed to remove public suffix (appellation), in this case, can carry out the work of removing public suffix by public suffix list, described public suffix list can be comprise the common vocabulary of following after surname, name or name such as " sir ", " Miss ", " teacher ".
Further, in some embodiments, " frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set " described in step S3 specifically comprises: add up the frequency of occurrences of each Chinese character in remaining Chinese character and by it by sorting from high to low, positive polarity character set listed in the Chinese character frequency of occurrences being positioned at front preset ratio.Such as, presetting 10% is ratio standard, and before after the Chinese character frequency of occurrences being sorted from high to low, positive polarity character set listed in 10% Chinese character.Certainly, in some embodiments, also can preset certain positive polarity Chinese character quantity, positive polarity character set listed in such as, front 100 Chinese characters after the Chinese character frequency of occurrences being sorted from high to low.
Refer to Fig. 2, be the structural representation of the device of the acquisition positive polarity Chinese character described in an embodiment of the present invention, described device comprises name acquiring unit 1, public words removal unit 2 and statistic unit 3;
Described name acquiring unit 1 for obtaining all name class vocabulary from given written material;
Described public words removal unit 2 for removing public words from the name class vocabulary obtained;
Described statistic unit 3 is for adding up frequency of occurrences height person in remaining Chinese character and being collected into positive polarity character set.
The required public words removed of described public words removal unit 2 generally can be thought and mainly comprises common prefix or public suffix.In name class vocabulary, situation the most common is common prefix, i.e. surname.Such as, when the name class vocabulary obtained be " open XX ", " Lee XX ", " Ouyang XX " etc., obviously its surname " is opened ", " Lee ", " Ouyang " etc. be common prefix, needs to be removed.Be specially and removed by the surname in name according to surnames list existing in database, remaining is like this exactly simple name, can remove the impact of surname.
In some embodiments, name acquiring unit 1 obtains all name class vocabulary and specifically comprises from given written material: to given written material, with part of speech annotation tool, participle and part-of-speech tagging are carried out to it, and according to part-of-speech tagging result acquisition name class vocabulary wherein.In some other embodiment, the mode of the acquisition name class vocabulary that other can also be adopted common obtains name class vocabulary from given written material.Be no matter the method being obtained name class vocabulary by participle and part-of-speech tagging described in these embodiments, or obtain the method for name class vocabulary by other means, all require that there is certain accuracy.Such as, " Mr. Zhang " this vocabulary (phrase) is not just the name class vocabulary that typically can be used for collection positive polarity Chinese character of the present invention.When the result that the means obtaining name class vocabulary obtain is accurate not as mentioned above, public words removal unit 2 not only needs to remove common prefix (surname) also to be needed to remove public suffix (appellation), in this case, can carry out the work of removing public suffix by public suffix list, described public suffix list can be comprise the common vocabulary of following after surname, name or name such as " sir ", " Miss ", " teacher ".
Further, in some embodiments, statistic unit 3 carries out, and " frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set " specifically comprises: add up the frequency of occurrences of each Chinese character in remaining Chinese character and by it by sorting from high to low, positive polarity character set listed in the Chinese character frequency of occurrences being positioned at front preset ratio.Such as, presetting 10% is ratio standard, and before after the Chinese character frequency of occurrences sorts by statistic unit 3 from high to low, positive polarity character set listed in 10% Chinese character.Certainly, in some embodiments, also can preset certain positive polarity Chinese character quantity, positive polarity character set listed in such as, front 100 Chinese characters after the Chinese character frequency of occurrences being sorted from high to low.
Be different from prior art, technique scheme can be found out conventional artificial mark person and be difficult to realize trickle emotion wherein by individual character from any given written material fragment, the positive polarity emotion Chinese character that mark requires cannot be reached, further to utilize as high-quality data resource.
It should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or terminal device and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or terminal device.When not more restrictions, the key element limited by statement " comprising ... " or " comprising ... ", and be not precluded within process, method, article or the terminal device comprising described key element and also there is other key element.In addition, in this article, " be greater than ", " being less than ", " exceeding " etc. be interpreted as and do not comprise this number; " more than ", " below ", " within " etc. be interpreted as and comprise this number.
Those skilled in the art should understand, the various embodiments described above can be provided as method, device or computer program.These embodiments can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.The hardware that all or part of step in the method that the various embodiments described above relate to can carry out instruction relevant by program has come, described program can be stored in the storage medium that computer equipment can read, for performing all or part of step described in the various embodiments described above method.Described computer equipment, includes but not limited to: personal computer, server, multi-purpose computer, special purpose computer, the network equipment, embedded device, programmable device, intelligent mobile terminal, intelligent home device, wearable intelligent equipment, vehicle intelligent equipment etc.; Described storage medium, includes but not limited to: the storage of RAM, ROM, magnetic disc, tape, CD, flash memory, USB flash disk, portable hard drive, storage card, memory stick, the webserver, network cloud storage etc.
The various embodiments described above describe with reference to the process flow diagram of method, equipment (system) and computer program according to embodiment and/or block scheme.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of computer equipment to produce a machine, making the instruction performed by the processor of computer equipment produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer equipment readable memory that works in a specific way of vectoring computer equipment, the instruction making to be stored in this computer equipment readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded on computer equipment, make to perform sequence of operations step on a computing device to produce computer implemented process, thus the instruction performed on a computing device is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Although be described the various embodiments described above; but those skilled in the art are once obtain the basic creative concept of cicada; then can make other change and amendment to these embodiments; so the foregoing is only embodiments of the invention; not thereby scope of patent protection of the present invention is limited; every utilize instructions of the present invention and accompanying drawing content to do equivalent structure or equivalent flow process conversion; or be directly or indirectly used in other relevant technical fields, be all in like manner included within scope of patent protection of the present invention.

Claims (8)

1. obtain a method for positive polarity Chinese character, it is characterized in that, comprise step:
All name class vocabulary is obtained from given written material;
Public words is removed from the name class vocabulary obtained;
Frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set.
2. the method obtaining positive polarity Chinese character as claimed in claim 1, is characterized in that, step " is removed public words " and specifically comprised from the name class vocabulary obtained: remove the surname in name class vocabulary.
3. the method obtaining positive polarity Chinese character as claimed in claim 1 or 2, it is characterized in that, step " obtains all name class vocabulary " and specifically comprises from given written material: to given written material, with part of speech annotation tool, participle and part-of-speech tagging are carried out to it, and according to part-of-speech tagging result acquisition name class vocabulary wherein.
4. the method obtaining positive polarity Chinese character as claimed in claim 1 or 2, it is characterized in that, " frequency of occurrences height person in remaining Chinese character is collected into positive polarity character set " in step specifically to comprise: add up the frequency of occurrences of each Chinese character in remaining Chinese character and by it by sorting from high to low, positive polarity character set listed in the Chinese character frequency of occurrences being positioned at front preset ratio.
5. obtain a device for positive polarity Chinese character, it is characterized in that, comprise name acquiring unit, public words removal unit and statistic unit;
Described name acquiring unit is used for from given written material, obtain all name class vocabulary;
Described public words removal unit is used for removing public words from the name class vocabulary obtained;
Described statistic unit is for adding up frequency of occurrences height person in remaining Chinese character and being collected into positive polarity character set.
6. the device obtaining positive polarity Chinese character as claimed in claim 5, is characterized in that, public words removal unit is removed public words and specifically comprised the surname removed in name class vocabulary from the name class vocabulary obtained.
7. the device of the acquisition positive polarity Chinese character as described in claim 5 or 6, it is characterized in that, name acquiring unit obtains all name class vocabulary and specifically comprises from given written material: to given written material, with part of speech annotation tool, participle and part-of-speech tagging are carried out to it, and according to part-of-speech tagging result acquisition name class vocabulary wherein.
8. the device of the acquisition positive polarity Chinese character as described in claim 5 or 6, it is characterized in that, statistic unit is added up frequency of occurrences height person in remaining Chinese character and is collected into positive polarity character set and specifically comprises: add up the frequency of occurrences of each Chinese character in remaining Chinese character and by it by sorting from high to low, positive polarity character set listed in the Chinese character frequency of occurrences being positioned at front preset ratio.
CN201510873465.2A 2015-12-02 2015-12-02 Method and device for obtaining positive Chinese characters Pending CN105488030A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510873465.2A CN105488030A (en) 2015-12-02 2015-12-02 Method and device for obtaining positive Chinese characters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510873465.2A CN105488030A (en) 2015-12-02 2015-12-02 Method and device for obtaining positive Chinese characters

Publications (1)

Publication Number Publication Date
CN105488030A true CN105488030A (en) 2016-04-13

Family

ID=55675014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510873465.2A Pending CN105488030A (en) 2015-12-02 2015-12-02 Method and device for obtaining positive Chinese characters

Country Status (1)

Country Link
CN (1) CN105488030A (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035975A (en) * 2014-05-23 2014-09-10 华东师范大学 Method utilizing Chinese online resources for supervising extraction of character relations remotely

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104035975A (en) * 2014-05-23 2014-09-10 华东师范大学 Method utilizing Chinese online resources for supervising extraction of character relations remotely

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘晨曦: "21世纪初大学生人名研究", 《太原师范学院学报(社会科学版)》 *

Similar Documents

Publication Publication Date Title
CN107204184B (en) Audio recognition method and system
CN107506389B (en) Method and device for extracting job skill requirements
CN103123624B (en) Determine method and device, searching method and the device of centre word
CN110020422A (en) The determination method, apparatus and server of Feature Words
WO2020151218A1 (en) Method and apparatus for generating specialised electric power word bank, and storage medium
CN106886567B (en) Microblogging incident detection method and device based on semantic extension
CN103020295B (en) A kind of problem label for labelling method and device
CN102663139A (en) Method and system for constructing emotional dictionary
CN103810212A (en) Automated database index creation method and system
CN102262625A (en) Method and device for extracting keywords of page
CN107832440B (en) Data mining method, device, server and computer readable storage medium
CN103559313B (en) Searching method and device
CN104866308A (en) Scenario image generation method and apparatus
CN103186523A (en) Electronic device and natural language analyzing method thereof
CN103186522A (en) Electronic device and natural language analyzing method thereof
CN104484058A (en) Instant expression image outputting method and instant expression image outputting device
CN112612664A (en) Electronic equipment testing method and device, electronic equipment and storage medium
CN105488471B (en) A kind of font recognition methods and device
CN105373528A (en) Method and device for analyzing sensitivity of text contents
CN110019556A (en) A kind of topic news acquisition methods, device and its equipment
CN105138513A (en) Method and device for determining similarity between Chinese vocabularies
CN111026940A (en) Network public opinion and risk information monitoring system and electronic equipment for power grid electromagnetic environment
CN105488030A (en) Method and device for obtaining positive Chinese characters
CN105425978A (en) Input data processing method and apparatus
CN104991920A (en) Label generation method and apparatus

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160413