Embodiment
Hereinafter describe embodiments of the invention in detail with reference to the accompanying drawings.It should be appreciated that following embodiments and unawareness
The figure limitation present invention, also, on the means solved the problems, such as according to the present invention, it is not absolutely required to be retouched according to following embodiments
The whole combinations for each side stated.For simplicity, to identical structure division or step, identical has been used to mark or mark
Number, and the description thereof will be omitted.
[hardware configuration of information processor]
First, the hardware configuration of the description information processing unit 1000 of reference picture 1.In addition, in the present embodiment as example
Following construction is described, but the information processor of the present invention is not limited to the construction shown in Fig. 1.
Fig. 1 is the figure for the hardware construction for showing the information processor 1000 in the present embodiment.In the present embodiment, with intelligence
Energy phone provides description as the example of information processor.Although it is noted that illustrating mobile terminal in the present embodiment
(include but is not limited to smart mobile phone, intelligent watch, Intelligent bracelet, music player devices) as information processor 1000, but
It is to be certainly not limited thereto, information processor of the invention can be notebook computer, tablet personal computer, (individual digital is helped PDA
Reason), (for example Digital photographic is mechanical, electrical for the PC either internet device with touching display screen and the information processing function
Refrigerator, television set etc.) etc. various devices.
As shown in figure 1, the input that information processor 1000 (2000,3000) includes being connected to each other via system bus connects
Mouth 102, CPU 103, ROM 104, RAM 105, external memory storage 106, output interface 107, display 108, communication unit 109
With short-distance wireless communication unit 110.Input interface 102 is that the execution for receiving data and function that user is inputted refers to
The interface of order, and be to receive to input from user for the operating unit (not shown) via such as button, button or touch-screen
Data and operational order interface.It note that the display 108 being described later on and operating unit can collect at least in part
Into, also, for example, it may be carry out picture output in same picture and receive the construction of user's operation.
CPU 103 is system control unit, and generally comprehensively control information processing unit 1000.In addition, for example,
CPU 103 carries out the display control of the display 108 of information processor 1000.The storages of ROM 104 CPU 103 is performed such as
The fixed data of tables of data and control program and operating system (OS) program etc..In the present embodiment, stored in ROM 104
Each control program, for example, under the OS stored in ROM 104 management, carrying out such as scheduling, task switching and interrupt processing
Deng software perform control.
RAM 105 (internal storage unit) for example by need stand-by power supply SRAM (static RAM),
DRAM etc. is constructed.In this case, RAM 105 can store the important of control variable of program etc. in a non-volatile manner
Data.In addition, management data of configuration information, information processor 1000 for storage information processing unit 1000 etc. are deposited
Storage area domain is also disposed in RAM 105.In addition, RAM 105 is used as CPU 103 working storage and main storage.
External memory storage 106, which is stored, such as predefines dictionary, sequence labelling model, for performing the participle according to the present invention
Application program of processing method etc..In addition, external memory storage 106 is stored such as via communication unit 109 and communicator
The various programs of information transmission/receiving control program that (not shown) is transmitted/received etc., and these programs use it is each
Plant information.
Output interface 107 is for being controlled the display picture with display information and application program to display 108
Interface.Display 108 is for example constructed by LCD (liquid crystal display).Have such as numerical value defeated by arranging on a display device 108
Enter the soft keyboard of the key of key, mode setting button, decision key, cancel key and power key etc., coming via display 108 can be received
From the input of user.
Information processor 1000 is via communication unit 109 for example, by channel radios such as Wi-Fi (Wireless Fidelity) or bluetooth
Letter method, data communication is performed with external device (ED) (not shown).
In addition, information processor 1000 can also via short-distance wireless communication unit 110, in short-range with
External device (ED) etc. carries out wireless connection and performs data communication.And short-distance wireless communication unit 110 by with communication unit
109 different communication means are communicated.It is, for example, possible to use its communication range is shorter than the communication means of communication unit 109
Bluetooth Low Energy (BLE) as short-distance wireless communication unit 110 communication means.In addition, being used as short-distance wireless communication list
The communication means of member 110, for example, it is also possible to perceive (Wi-Fi Aware) using NFC (near-field communication) or Wi-Fi.
[first embodiment]
Next, reference picture 2 illustrates the software configuration of the information processor according to first embodiment.
As shown in Fig. 2 information processor 1000 includes:Selecting unit 1101, to participle object, (such as user is by touching for it
Touch the sentence of screen input) participle is carried out, obtain the word segmentation result represented with the group including multiple words;First concatenation unit 1102
Splicing is carried out to the adjacent word in group;Sequence labelling unit 1103 utilizes sequence labelling model, to being spliced by described first
Each word that unit is carried out in the combination after splicing carries out sequence labelling, and according to the result of sequence labelling to described group
Word in conjunction is merged, wherein, sequence labelling unit 1103 includes extraction unit 11031, and it is from by first concatenation unit
1102 extracted in each participle in the combination after splicing the word of predefined type;Prediction section 11032, it is according to institute
Predefined type is stated to predict the corresponding word segmentation result of extracted word;Selector 11033, it is selected from the word segmentation result predicted
Select word segmentation result;And merging portion 11034, it is configured as according to by the selected word segmentation result of the selector, to institute
The word stated in combination is merged;And second concatenation unit 1104 according to pre-defined rule to by the sequence labelling unit carry out
The word in combination after merging is spliced.
Below, reference picture 3 illustrates participle processing method according to a first embodiment of the present invention.
As shown in figure 3, the participle processing method, it may include following steps S101-S104:
In step S101, treat that the sentence of participle is matched with word in dictionary for word segmentation by obtaining, then all
The word combination being fitted on all takes out, and calculates the combination of the highest scoring in participle strategy in each combination, the participle plan
Slightly include:Term weighing and language model scores.
Next, into step S102, in step s 102, adjacent word in word segmentation result is stitched together, if should
As a result occur in dictionary, the splicing result is just replaced to the former word segmentation result in dictionary.
Then, step S103 is entered, in step s 103, the word segmentation result that previous step is produced, into sequence labelling
Model, is screened the result of sequence labelling model after carrying out sequence labelling, and according to the result after screening by previous step
Word segmentation result Partial Fragment is merged, and enters step S104.
In step S104, the common collocation of some in the word segmentation result that previous step is produced is spliced, for example:Quantity
Word, date, time and letter expressing etc., and it regard result after splicing as final word segmentation result.
Hereinafter, so as to sentence, " on January 29th, 2016, area leads Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Jun
Victory, Sun Qijun have visited Ministry of Foreign Affairs, the central unit in Deng Zhu areas of People Daily agency " carry out participle exemplified by, illustrate above-mentioned participle
The process of processing.
In step S101, each word in basic participle, the sentence of acquisition is carried out, in different ways by participle pair
As being split as multiple words, multiple contaminations are formed.Each word during each is combined is carried out with the word in dictionary for word segmentation respectively
Matching, then all takes out all word combinations matched.
For example, the word in sentence is split as into following several combinations:
(1) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu
Army, victory, grandson, its, army, visit, diplomacy, portion, People's Daily, society, etc., in area, center, unit;
(2) 2016, year, 1, the moon, 29, day, area, leader, Wu Gui, English, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu,
Army's victory, grandson, its, army, visited, the Ministry of Foreign Affairs people, day newspaper office etc., in area, central unit;
(3) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hong, will, old, great waves, it is sweet, quiet, in, Liu
Army, victory, grandson, its, army, visited, Ministry of Foreign Affairs, the people, day newspaper office, etc., in area, center, unit.
Then score of each combination in participle strategy is calculated according to participle strategy, and selects the combination of highest scoring.
Participle strategy includes but is not limited to term weighing and language model scores.Wherein, for the participle vocabulary used in the step,
When setting up vocabulary, except preserving word in itself, the word frequency that the word occurs in language material will also maintain.Term weighing is to work as
The cumulative sum of each word word frequency in preceding participle combination.
Illustrate the process of above-mentioned calculating score with a better simply example below.Such as " I ", " love ", " Beijing day
In the combination of peace door ", the word frequency of " I " is 130132, and the word frequency of " love " is 74150, and the word frequency in " Beijing Tian An-men " is 5924,
The term weighing of the combination is exactly 210206.Then the term weighing point of multiple combinations is normalized, each term weighing point is returned
One change calculation be:Using highest term weighing in all combinations as denominator, current term weighing is used as molecule.So
Afterwards, with score of the bigram probabilistic language models entirely combined as language model.Finally, the score and word of language model
The score of language weight, which is multiplied, is used as final score.
Combination (1)-(3) are calculated using the participle strategy of term weighing, its scoring event difference is as follows:
Combination (1) is scored at:0.7411.
Combination (2) is scored at:1.0.
Combination (3) is scored at:0.8951.
When the scoring event that each combination is calculated using the participle strategy of language model scores is as follows:
Combination (1) is scored at:0.9013.
Combination (2) is scored at:0.7542.
Combination (3) is scored at:0.9631.
The final score for combining (1)-(3) is respectively 0.6680,0.7542,0.8620, selects the group of highest scoring, i.e., the
(3) group proceeds the processing of next step.
In step s 102, for the word segmentation result in combination (3), adjacent word is stitched together, for example, by " people "
" day newspaper office " is stitched together, and the result of spliced word segmentation processing is as follows:
2016th, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun,
Victory, grandson, its, army, visited, Ministry of Foreign Affairs, People Daily agency, etc., in area, center, unit.
If spliced above-mentioned word segmentation result is being not present in dictionary, replaced with the spliced word segmentation result
Former word segmentation result described in dictionary.
Step S103 includes step S1031-S1031 as shown in Figure 4.
In step S1031, from previous step produce word segmentation result, i.e., " 2016, year, 1, the moon, 29, day, area, leader,
Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun, victory, grandson, its, army, visited, Ministry of Foreign Affairs, People's Daily
Society, etc., in area, center, unit ", it is middle to extract the word relevant with name, i.e., " Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet,
It is quiet, in, Liu Jun, victory, grandson, its, army ".
Whether it is the word relevant with name according to the word extracted in step S1032, to predict point of extracted word
Word result:
Wu, Gui Ying:Wu Guiying
King, Hao:Wang Hao
Old, great waves:Chen Tao
It is sweet, quiet, in:Gan Jingzhong
Liu Jun, victory:Liu Junsheng
Grandson, its, army:Sun Qijun.
In step S1033, the result to sequence labelling model is screened, and removal is clearly not the result of name.Example
Such as, for some participle objects, it is possible to occur in sequence labelling result and similar " man of king " is labeled as name by mistake
As a result, it is therefore desirable to which annotation results are further screened.
In step S1034, according to the result after screening, the word segmentation result Partial Fragment of previous step is merged, closed
And the word segmentation result obtained afterwards is as follows:
2016th, year, 1, the moon, 29, day, area, leader, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qi
Army, visited, Ministry of Foreign Affairs, People Daily agency, etc., in area, center, unit.
In step S104, the common collocation of some in the word segmentation result that previous step is produced is spliced, for example:Will
" 2016, year, 1, the moon, 29, day " be spliced into " on January 29th, 2016 ".Common collocation includes numeral-classifier compound, date, time and word
Expression etc., and it regard result after splicing as final word segmentation result.For this example, spliced result is:On January 29th, 2016,
Area, leader, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qijun, visited, Ministry of Foreign Affairs, People's Daily
Society, etc., in area, center, unit.
According to the present invention, by being spliced to word segmentation result, being marked and spliced again, by increasing capacitance it is possible to increase of word segmentation result
Granularity.
[second embodiment]
In the first embodiment, to the reading of dictionary and sequence labelling model, using from RAM105 reading programs and at it
The mode of middle operation program.And in a second embodiment, in sequence labelling processing, sequence labelling unit is in external memory storage
The sequence labelling processing is carried out in 106.
In information processor in the prior art, RAM etc. internal memory (internal storage unit) is generally included, with
And the external memory (external memory storage) of SD card and hard disk etc..RAM is commonly used to operation application program.And external memory is commonly used to deposit
Store up database and application program.According to common technology, sequence criteria model can be stored in external memory, and is run in internal memory
The corresponding program of sequence criteria model.This can cause mobile phone carry out participle when occupancy internal memory it is more, processing speed is slower.And
In the present embodiment, although sequence labelling model also is stored in external memory, but operation correspondence program is to be carried out in external memory
's.
The word segmentation processing that is carried out in second embodiment will be illustrated with Fig. 5 and Fig. 6 below.Fig. 5 is exemplified with according to the present invention
The high-level schematic functional block diagram of the information processor of second embodiment.
As shown in figure 5, information processor 2000 includes:Participle unit 2102, it carries out participle to participle object, and obtains
The word segmentation result of multiple contaminations must be expressed as;And sequence labelling unit 2103, the sequence labelling unit deposits in outside
The sequence labelling processing is carried out in reservoir, its be directed to by participle object carry out participle acquisition, be expressed as multiple contaminations
Word segmentation result, sequence labelling is carried out to the word in the combination using sequence labelling model, and according to the result of sequence labelling
Word in the combination is merged.Wherein, sequence labelling unit 2103 includes:Storage part 21031, it is by sequence criteria mould
The emission probability and state probability of type are stored in the first file of external memory storage;Calculating part 21032, it is to the combination
In word characteristic function carry out Hash operation, by each characteristic function and the emission probability or shape corresponding with this feature function
The storage location of state probability, is stored in the second file with cryptographic Hash;Extraction unit 21033, it is from being stored by the calculating part
The storage location, adjacent word is used as a probability for combining word in the extraction combination;And merging portion 21034, its by with
It is set to and each word in the combination is spliced according to the probability extracted.
Flow charts of the Fig. 6 exemplified with progress participle processing method according to a second embodiment of the present invention.
Referring to Fig. 6, carry out participle with participle object " I loves Beijing Tian An-men " to illustrate according to the present invention the
The carry out participle processing method of two embodiments.
In step s 201, the sentence is divided into:My love, north, capital, day, peace, door.
In step S202, sequence labelling is carried out to the word segmentation result in step S201, sequence labelling processing includes such as Fig. 7
Shown step S2021 to S2024.
In storing step S2021, the grand master pattern shape parameter of sequence criteria model is divided into the storage of three parts, is characterized respectively
Function (the second parameter), emission probability and state probability (the first parameter), feature templates and other specification (the 3rd parameter).Its
In, emission probability and state probability are stored (the first file) as a unique file.
In calculation procedure S2022, Hash operation is carried out to characteristic function using severe snow hash algorithm, then by each feature
The storage location (value) of function and the emission probability corresponding with this feature function or state probability, is stored in another with cryptographic Hash
In individual binary file (the second file).Storage feature templates and other specification are stored as the 3rd file.
Specifically, the template is when sequence labelling model is placed in " north " word, feature templates:U06:%x [0,
0]/%x [1,0], its template is construed to current word and combining for its latter position word situation occurs i.e.:U06:North/capital.We will
“U06:North/capital " is brought into severe snow hash function as variable, and calculating obtains three cryptographic Hash:Main cryptographic Hash M, left cryptographic Hash L
With right cryptographic Hash R.Wherein, binary system displacement operation is carried out using main cryptographic Hash to obtain storage value (i.e. character pair is specifically
Location, such as address of " north " and " capital " among file), and by the left cryptographic Hash and right cryptographic Hash of acquisition and the left Kazakhstan that pre-sets
Uncommon value and right cryptographic Hash compare, if identical, it is determined that be stored in the storage location as main cryptographic Hash in the second file;Such as
Fruit is true, then continues to take out emission probability (or state probability) storage location of storage inside, if vacation, then return to -1;Such as
It is really unequal, then Jia 1 on the basis of M, repeat above-mentioned value and compare operation.
In extraction step S2023, characteristic function and the hair corresponding with this feature function are stored from step S2022
The storage location of probability or state probability is penetrated, adjacent word is extracted as the probability size of a joint word.
Specifically, will be repeated in step S2032 value compare the return value (address) of operation will be as emission probability
Position in (or state probability) its first file, carries out position value operation.The weights or probability number of taking-up and sequence mark
The label of note number is identical, when each weights represent current word label as B, " U06:The probability that north/capital " joint occurs is big
It is small.For example, the probability that " Beijing " joint occurs is 98%, the probability that " Tian An-men " joint occurs is 95%.
In combining step S2024, according to the probability calculated in step S2023, the word segmentation result in step 201 is carried out
Splicing.
Specifically, it is in step S201 word segmentation result:My love, north, capital, day, peace, door.According in step S2023
The probability of calculating, " Beijing " is 98% as the probability of a joint word, the probability that " Tian An-men " occurs as a joint word
For 95%, it is thus determined that " north " and " capital " is spliced into " Beijing ", " my god ", " peace " and " door " be spliced into " Tian An-men ".In step
In S2024, the word segmentation result finally obtained is:My love, Beijing, Tian An-men.
According to the second embodiment of the present invention, sequence labelling processing is carried out in external memory rather than in internal memory, is reduced pair
The occupancy of information processor internal memory, improves the speed of service of information processor.
[3rd embodiment]
The hardware configuration of the information processor of the third embodiment of the present invention and first embodiment and second embodiment
The hardware configuration of information processor is identical.The technical scheme of 3rd embodiment is the technology of first embodiment and second embodiment
The combination of scheme.That is, the information processor of 3rd embodiment includes the selecting unit in first embodiment, the first concatenation unit
With the second concatenation unit, and external memory storage and sequence labelling unit in second embodiment.
High-level schematic functional block diagrams of the Fig. 8 exemplified with information processor according to a third embodiment of the present invention.
As shown in figure 8, information processor 3000 includes:Selecting unit 3101, to participle object, (such as user is by touching for it
Touch the sentence of screen input) participle is carried out, obtain the word segmentation result represented with the group including multiple words;First concatenation unit 3102
Splicing is carried out to the adjacent word in group;Sequence labelling unit 3103 utilizes sequence labelling model, to being spliced by described first
Each word that unit is carried out in the combination after splicing carries out sequence labelling, and according to the result of sequence labelling to described group
Word in conjunction is merged;Second concatenation unit 3104 according to pre-defined rule to being merged by the sequence labelling unit after
Word in combination is spliced.
Wherein, sequence labelling unit 3103 includes:Storage part 31031, it is by the emission probability and shape of sequence criteria model
State probability is stored in the first file of external memory storage;Calculating part 31032, it is the characteristic function to the word in the combination
Hash operation is carried out, by the storage position of each characteristic function and the emission probability corresponding with this feature function or state probability
Put, be stored in cryptographic Hash in the second file;Extraction unit 31033, it is carried from the storage location stored by the calculating part
Adjacent word in the combination is taken as the probability of a joint word;And merging portion 31034, it is configured as according to being extracted
Probability splices to each word in the combination.
In the participle processing method of 3rd embodiment, including selection step, the first splicing step in first embodiment
With second splicing step, and first splicing step and second splicing step between sequence labelling step, then be second embodiment
In sequence labelling step.
According to a third embodiment of the present invention, result in that a kind of participle granularity is big and committed memory is few and handles
Fireballing information processor.
The information processor of the present invention, results in following technique effect:Have as far as possible common collocation and semantically
The combination of meaning is cut out, in that context it may be convenient to more meaningful fragment is extracted from word segmentation result.
Although with reference to exemplary embodiment, invention has been described above, above-described embodiment is only to illustrate this hair
Bright technical concepts and features, it is not intended to limit the scope of the present invention.It is all to be done according to spirit of the invention
Any equivalent variations or modification, should all be included within the scope of the present invention.