CN107291695A - Information processor and its participle processing method - Google Patents

Information processor and its participle processing method Download PDF

Info

Publication number
CN107291695A
CN107291695A CN201710505392.0A CN201710505392A CN107291695A CN 107291695 A CN107291695 A CN 107291695A CN 201710505392 A CN201710505392 A CN 201710505392A CN 107291695 A CN107291695 A CN 107291695A
Authority
CN
China
Prior art keywords
word
participle
sequence labelling
combination
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710505392.0A
Other languages
Chinese (zh)
Other versions
CN107291695B (en
Inventor
侯兴林
亓超
王卓然
马宇驰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Triangle Animal (beijing) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Triangle Animal (beijing) Technology Co Ltd filed Critical Triangle Animal (beijing) Technology Co Ltd
Priority to CN201811400632.1A priority Critical patent/CN109492228B/en
Priority to CN201710505392.0A priority patent/CN107291695B/en
Publication of CN107291695A publication Critical patent/CN107291695A/en
Application granted granted Critical
Publication of CN107291695B publication Critical patent/CN107291695B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention provides a kind of information processor and its participle processing method, and described information processing unit includes:Selecting unit, it is configured to carry out participle to participle object, obtains the word segmentation result for being expressed as multiple contaminations;First concatenation unit, it is configured as carrying out splicing to the adjacent word in the combination;Sequence labelling unit, it is configured to, with sequence labelling model, to carrying out sequence labelling by each word in the combination after first concatenation unit progress splicing, and the word in the combination is merged according to the result of sequence labelling;And second concatenation unit, it is configured to splice the word after being merged by the sequence labelling unit according to pre-defined rule.

Description

Information processor and its participle processing method
Technical field
The information processor and its participle processing method of word segmentation processing can be carried out the present invention relates to a kind of.
Background technology
Existing segmenting method mainly includes following three kinds:Segmenting method based on string matching, point based on understanding Word method and the segmenting method based on statistics.For example, prior art (Publication No. CN104462051A Chinese patent application) Middle to have recorded a kind of segmenting method based on statistics, it includes:Word in a period of time is obtained to be searched in different search fields Number of times, the statistics fraction of word is calculated according to searched number of times;The length fraction of word is calculated according to the length gauge of word;According to The statistics fraction and length fraction of word obtain the score value of word, and dictionary for word segmentation is generated by the score value of word and word;Obtain the sentence for treating participle Son, the sentence for treating participle is matched with the word in the dictionary for word segmentation to obtain multiple word segmentation results, each is calculated The score value of word segmentation result, using the high word segmentation result of score value as the sentence for treating participle word segmentation result.
However, in the participle technique disclosed in above-mentioned patent gazette, because word segmentation result excessively relies on dictionary for word segmentation, such as Fruit is used for the information processor such as mobile phone or tablet personal computer, then because that can not use excessive dictionary, and there is word segmentation result The problem of granularity is too thin.Simultaneously as needing the operation program in internal memory, excessive memory source is occupied, therefore there is system The problem of speed of service of uniting is slower.
The content of the invention
In view of above mentioned problem of the prior art, for solve these above-mentioned problems whole or at least one, it is proposed that this Invention, it is an object of the invention to provide a kind of participle granularity is big, the fireballing word segmentation processing technology of word segmentation processing.
According to an aspect of the present invention the information processor of word segmentation processing can be carried out there is provided a kind of, it is characterised in that Described information processing unit includes:Selecting unit, it is configured to carry out participle to participle object, obtains the group for being expressed as multiple words The word segmentation result of conjunction;First concatenation unit, it is configured as carrying out splicing to the adjacent word in the combination;Sequence labelling Unit, it is configured to, with sequence labelling model, to carrying out the combination after splicing by first concatenation unit In each word carry out sequence labelling, and the word in the combination is merged according to the result of sequence labelling;And second spell Order member, it is configured to splice the word after being merged by the sequence labelling unit according to pre-defined rule.
By the technical scheme of first aspect present invention, a kind of big information processor of participle granularity is realized.
According to another aspect of the present invention the information processor of word segmentation processing, described information can be carried out there is provided a kind of Processing unit includes the external memory storage of storage sequence labelling model, it is characterised in that described information processing unit includes:Participle Unit, it is configured as carrying out participle object participle, and obtains the word segmentation result for being expressed as multiple contaminations;And sequence Mark unit, its be configured as be directed to by participle object carry out participle acquisition, be expressed as the word segmentation result of multiple contaminations, Sequence labelling processing is carried out to the word in the combination using sequence labelling model, and according to the result of sequence labelling to described group Word in conjunction is merged, wherein, the sequence labelling unit carries out the sequence labelling processing in the external memory storage.
By the technical scheme of second aspect of the present invention, realize that a kind of committed memory is small, at the fast information of processing speed Manage device.
There is provided a kind of participle processing method for information processor, the participle according to another aspect of the present invention Processing method comprises the following steps:Step is selected, participle is carried out to participle object, and obtains point for being expressed as multiple contaminations Word result;First splicing step, splicing is carried out to the adjacent word in the combination;Sequence labelling step, utilizes sequence mark Injection molding type, sequence labelling is carried out to carrying out each word in the combination after splicing in the described first splicing step, and The word in the combination is merged according to the result of sequence labelling;And the second splicing step, it is configured according to predetermined Rule to being merged in the sequence labelling step after combination in word splice.
There is provided a kind of participle processing method for information processor, described information according to another aspect of the present invention Processing unit includes the external memory storage of storage sequence labelling model, and the participle processing method comprises the following steps:Participle is walked Suddenly, participle is carried out to participle object, and obtains the word segmentation result for being expressed as multiple contaminations;Sequence labelling step, for inciting somebody to action Participle object carry out participle acquisition, the word segmentation result of multiple contaminations is expressed as, using sequence labelling model to described group Word in conjunction carries out sequence labelling processing, and the word in the combination is merged according to the result of sequence labelling, wherein, In sequence labelling step, sequence labelling processing is carried out in the external memory storage.
The information processor and its participle processing method of the present invention, realizes with larger granularity to carry out participle, And less memory source is taken, so as to accelerate the processing speed of information processor.
Brief description of the drawings
In order to illustrate more clearly of the technical scheme in the embodiment of the present application, make required in being described below to embodiment Accompanying drawing is briefly described, it should be apparent that, drawings in the following description are only some implementations described in the application Example, on the premise of not paying creative work, can also be according to these accompanying drawings for this area or those of ordinary skill Obtain other accompanying drawings.
Block diagrams of the Fig. 1 exemplified with the hardware configuration of the information processor according to the present invention.
High-level schematic functional block diagrams of the Fig. 2 exemplified with information processor according to a first embodiment of the present invention.
Flow charts of the Fig. 3 exemplified with participle processing method according to a first embodiment of the present invention.
Method flow diagrams of the Fig. 4 exemplified with progress sequence labelling processing according to a first embodiment of the present invention.
High-level schematic functional block diagrams of the Fig. 5 exemplified with information processor according to a second embodiment of the present invention.
Flow charts of the Fig. 6 exemplified with progress participle processing method according to a second embodiment of the present invention.
Method flow diagrams of the Fig. 7 exemplified with progress sequence labelling processing according to a second embodiment of the present invention.
High-level schematic functional block diagrams of the Fig. 8 exemplified with information processor according to a third embodiment of the present invention.
Embodiment
Hereinafter describe embodiments of the invention in detail with reference to the accompanying drawings.It should be appreciated that following embodiments and unawareness The figure limitation present invention, also, on the means solved the problems, such as according to the present invention, it is not absolutely required to be retouched according to following embodiments The whole combinations for each side stated.For simplicity, to identical structure division or step, identical has been used to mark or mark Number, and the description thereof will be omitted.
[hardware configuration of information processor]
First, the hardware configuration of the description information processing unit 1000 of reference picture 1.In addition, in the present embodiment as example Following construction is described, but the information processor of the present invention is not limited to the construction shown in Fig. 1.
Fig. 1 is the figure for the hardware construction for showing the information processor 1000 in the present embodiment.In the present embodiment, with intelligence Energy phone provides description as the example of information processor.Although it is noted that illustrating mobile terminal in the present embodiment (include but is not limited to smart mobile phone, intelligent watch, Intelligent bracelet, music player devices) as information processor 1000, but It is to be certainly not limited thereto, information processor of the invention can be notebook computer, tablet personal computer, (individual digital is helped PDA Reason), (for example Digital photographic is mechanical, electrical for the PC either internet device with touching display screen and the information processing function Refrigerator, television set etc.) etc. various devices.
As shown in figure 1, the input that information processor 1000 (2000,3000) includes being connected to each other via system bus connects Mouth 102, CPU 103, ROM 104, RAM 105, external memory storage 106, output interface 107, display 108, communication unit 109 With short-distance wireless communication unit 110.Input interface 102 is that the execution for receiving data and function that user is inputted refers to The interface of order, and be to receive to input from user for the operating unit (not shown) via such as button, button or touch-screen Data and operational order interface.It note that the display 108 being described later on and operating unit can collect at least in part Into, also, for example, it may be carry out picture output in same picture and receive the construction of user's operation.
CPU 103 is system control unit, and generally comprehensively control information processing unit 1000.In addition, for example, CPU 103 carries out the display control of the display 108 of information processor 1000.The storages of ROM 104 CPU 103 is performed such as The fixed data of tables of data and control program and operating system (OS) program etc..In the present embodiment, stored in ROM 104 Each control program, for example, under the OS stored in ROM 104 management, carrying out such as scheduling, task switching and interrupt processing Deng software perform control.
RAM 105 (internal storage unit) for example by need stand-by power supply SRAM (static RAM), DRAM etc. is constructed.In this case, RAM 105 can store the important of control variable of program etc. in a non-volatile manner Data.In addition, management data of configuration information, information processor 1000 for storage information processing unit 1000 etc. are deposited Storage area domain is also disposed in RAM 105.In addition, RAM 105 is used as CPU 103 working storage and main storage.
External memory storage 106, which is stored, such as predefines dictionary, sequence labelling model, for performing the participle according to the present invention Application program of processing method etc..In addition, external memory storage 106 is stored such as via communication unit 109 and communicator The various programs of information transmission/receiving control program that (not shown) is transmitted/received etc., and these programs use it is each Plant information.
Output interface 107 is for being controlled the display picture with display information and application program to display 108 Interface.Display 108 is for example constructed by LCD (liquid crystal display).Have such as numerical value defeated by arranging on a display device 108 Enter the soft keyboard of the key of key, mode setting button, decision key, cancel key and power key etc., coming via display 108 can be received From the input of user.
Information processor 1000 is via communication unit 109 for example, by channel radios such as Wi-Fi (Wireless Fidelity) or bluetooth Letter method, data communication is performed with external device (ED) (not shown).
In addition, information processor 1000 can also via short-distance wireless communication unit 110, in short-range with External device (ED) etc. carries out wireless connection and performs data communication.And short-distance wireless communication unit 110 by with communication unit 109 different communication means are communicated.It is, for example, possible to use its communication range is shorter than the communication means of communication unit 109 Bluetooth Low Energy (BLE) as short-distance wireless communication unit 110 communication means.In addition, being used as short-distance wireless communication list The communication means of member 110, for example, it is also possible to perceive (Wi-Fi Aware) using NFC (near-field communication) or Wi-Fi.
[first embodiment]
Next, reference picture 2 illustrates the software configuration of the information processor according to first embodiment.
As shown in Fig. 2 information processor 1000 includes:Selecting unit 1101, to participle object, (such as user is by touching for it Touch the sentence of screen input) participle is carried out, obtain the word segmentation result represented with the group including multiple words;First concatenation unit 1102 Splicing is carried out to the adjacent word in group;Sequence labelling unit 1103 utilizes sequence labelling model, to being spliced by described first Each word that unit is carried out in the combination after splicing carries out sequence labelling, and according to the result of sequence labelling to described group Word in conjunction is merged, wherein, sequence labelling unit 1103 includes extraction unit 11031, and it is from by first concatenation unit 1102 extracted in each participle in the combination after splicing the word of predefined type;Prediction section 11032, it is according to institute Predefined type is stated to predict the corresponding word segmentation result of extracted word;Selector 11033, it is selected from the word segmentation result predicted Select word segmentation result;And merging portion 11034, it is configured as according to by the selected word segmentation result of the selector, to institute The word stated in combination is merged;And second concatenation unit 1104 according to pre-defined rule to by the sequence labelling unit carry out The word in combination after merging is spliced.
Below, reference picture 3 illustrates participle processing method according to a first embodiment of the present invention.
As shown in figure 3, the participle processing method, it may include following steps S101-S104:
In step S101, treat that the sentence of participle is matched with word in dictionary for word segmentation by obtaining, then all The word combination being fitted on all takes out, and calculates the combination of the highest scoring in participle strategy in each combination, the participle plan Slightly include:Term weighing and language model scores.
Next, into step S102, in step s 102, adjacent word in word segmentation result is stitched together, if should As a result occur in dictionary, the splicing result is just replaced to the former word segmentation result in dictionary.
Then, step S103 is entered, in step s 103, the word segmentation result that previous step is produced, into sequence labelling Model, is screened the result of sequence labelling model after carrying out sequence labelling, and according to the result after screening by previous step Word segmentation result Partial Fragment is merged, and enters step S104.
In step S104, the common collocation of some in the word segmentation result that previous step is produced is spliced, for example:Quantity Word, date, time and letter expressing etc., and it regard result after splicing as final word segmentation result.
Hereinafter, so as to sentence, " on January 29th, 2016, area leads Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Jun Victory, Sun Qijun have visited Ministry of Foreign Affairs, the central unit in Deng Zhu areas of People Daily agency " carry out participle exemplified by, illustrate above-mentioned participle The process of processing.
In step S101, each word in basic participle, the sentence of acquisition is carried out, in different ways by participle pair As being split as multiple words, multiple contaminations are formed.Each word during each is combined is carried out with the word in dictionary for word segmentation respectively Matching, then all takes out all word combinations matched.
For example, the word in sentence is split as into following several combinations:
(1) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Army, victory, grandson, its, army, visit, diplomacy, portion, People's Daily, society, etc., in area, center, unit;
(2) 2016, year, 1, the moon, 29, day, area, leader, Wu Gui, English, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu, Army's victory, grandson, its, army, visited, the Ministry of Foreign Affairs people, day newspaper office etc., in area, central unit;
(3) 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hong, will, old, great waves, it is sweet, quiet, in, Liu Army, victory, grandson, its, army, visited, Ministry of Foreign Affairs, the people, day newspaper office, etc., in area, center, unit.
Then score of each combination in participle strategy is calculated according to participle strategy, and selects the combination of highest scoring. Participle strategy includes but is not limited to term weighing and language model scores.Wherein, for the participle vocabulary used in the step, When setting up vocabulary, except preserving word in itself, the word frequency that the word occurs in language material will also maintain.Term weighing is to work as The cumulative sum of each word word frequency in preceding participle combination.
Illustrate the process of above-mentioned calculating score with a better simply example below.Such as " I ", " love ", " Beijing day In the combination of peace door ", the word frequency of " I " is 130132, and the word frequency of " love " is 74150, and the word frequency in " Beijing Tian An-men " is 5924, The term weighing of the combination is exactly 210206.Then the term weighing point of multiple combinations is normalized, each term weighing point is returned One change calculation be:Using highest term weighing in all combinations as denominator, current term weighing is used as molecule.So Afterwards, with score of the bigram probabilistic language models entirely combined as language model.Finally, the score and word of language model The score of language weight, which is multiplied, is used as final score.
Combination (1)-(3) are calculated using the participle strategy of term weighing, its scoring event difference is as follows:
Combination (1) is scored at:0.7411.
Combination (2) is scored at:1.0.
Combination (3) is scored at:0.8951.
When the scoring event that each combination is calculated using the participle strategy of language model scores is as follows:
Combination (1) is scored at:0.9013.
Combination (2) is scored at:0.7542.
Combination (3) is scored at:0.9631.
The final score for combining (1)-(3) is respectively 0.6680,0.7542,0.8620, selects the group of highest scoring, i.e., the (3) group proceeds the processing of next step.
In step s 102, for the word segmentation result in combination (3), adjacent word is stitched together, for example, by " people " " day newspaper office " is stitched together, and the result of spliced word segmentation processing is as follows:
2016th, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun, Victory, grandson, its, army, visited, Ministry of Foreign Affairs, People Daily agency, etc., in area, center, unit.
If spliced above-mentioned word segmentation result is being not present in dictionary, replaced with the spliced word segmentation result Former word segmentation result described in dictionary.
Step S103 includes step S1031-S1031 as shown in Figure 4.
In step S1031, from previous step produce word segmentation result, i.e., " 2016, year, 1, the moon, 29, day, area, leader, Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, quiet, in, Liu Jun, victory, grandson, its, army, visited, Ministry of Foreign Affairs, People's Daily Society, etc., in area, center, unit ", it is middle to extract the word relevant with name, i.e., " Wu, Gui Ying, king, Hao, Chen Hongzhi, old, great waves, it is sweet, It is quiet, in, Liu Jun, victory, grandson, its, army ".
Whether it is the word relevant with name according to the word extracted in step S1032, to predict point of extracted word Word result:
Wu, Gui Ying:Wu Guiying
King, Hao:Wang Hao
Old, great waves:Chen Tao
It is sweet, quiet, in:Gan Jingzhong
Liu Jun, victory:Liu Junsheng
Grandson, its, army:Sun Qijun.
In step S1033, the result to sequence labelling model is screened, and removal is clearly not the result of name.Example Such as, for some participle objects, it is possible to occur in sequence labelling result and similar " man of king " is labeled as name by mistake As a result, it is therefore desirable to which annotation results are further screened.
In step S1034, according to the result after screening, the word segmentation result Partial Fragment of previous step is merged, closed And the word segmentation result obtained afterwards is as follows:
2016th, year, 1, the moon, 29, day, area, leader, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qi Army, visited, Ministry of Foreign Affairs, People Daily agency, etc., in area, center, unit.
In step S104, the common collocation of some in the word segmentation result that previous step is produced is spliced, for example:Will " 2016, year, 1, the moon, 29, day " be spliced into " on January 29th, 2016 ".Common collocation includes numeral-classifier compound, date, time and word Expression etc., and it regard result after splicing as final word segmentation result.For this example, spliced result is:On January 29th, 2016, Area, leader, Wu Guiying, Wang Hao, Chen Hongzhi, Chen Tao, Gan Jingzhong, Liu Junsheng, Sun Qijun, visited, Ministry of Foreign Affairs, People's Daily Society, etc., in area, center, unit.
According to the present invention, by being spliced to word segmentation result, being marked and spliced again, by increasing capacitance it is possible to increase of word segmentation result Granularity.
[second embodiment]
In the first embodiment, to the reading of dictionary and sequence labelling model, using from RAM105 reading programs and at it The mode of middle operation program.And in a second embodiment, in sequence labelling processing, sequence labelling unit is in external memory storage The sequence labelling processing is carried out in 106.
In information processor in the prior art, RAM etc. internal memory (internal storage unit) is generally included, with And the external memory (external memory storage) of SD card and hard disk etc..RAM is commonly used to operation application program.And external memory is commonly used to deposit Store up database and application program.According to common technology, sequence criteria model can be stored in external memory, and is run in internal memory The corresponding program of sequence criteria model.This can cause mobile phone carry out participle when occupancy internal memory it is more, processing speed is slower.And In the present embodiment, although sequence labelling model also is stored in external memory, but operation correspondence program is to be carried out in external memory 's.
The word segmentation processing that is carried out in second embodiment will be illustrated with Fig. 5 and Fig. 6 below.Fig. 5 is exemplified with according to the present invention The high-level schematic functional block diagram of the information processor of second embodiment.
As shown in figure 5, information processor 2000 includes:Participle unit 2102, it carries out participle to participle object, and obtains The word segmentation result of multiple contaminations must be expressed as;And sequence labelling unit 2103, the sequence labelling unit deposits in outside The sequence labelling processing is carried out in reservoir, its be directed to by participle object carry out participle acquisition, be expressed as multiple contaminations Word segmentation result, sequence labelling is carried out to the word in the combination using sequence labelling model, and according to the result of sequence labelling Word in the combination is merged.Wherein, sequence labelling unit 2103 includes:Storage part 21031, it is by sequence criteria mould The emission probability and state probability of type are stored in the first file of external memory storage;Calculating part 21032, it is to the combination In word characteristic function carry out Hash operation, by each characteristic function and the emission probability or shape corresponding with this feature function The storage location of state probability, is stored in the second file with cryptographic Hash;Extraction unit 21033, it is from being stored by the calculating part The storage location, adjacent word is used as a probability for combining word in the extraction combination;And merging portion 21034, its by with It is set to and each word in the combination is spliced according to the probability extracted.
Flow charts of the Fig. 6 exemplified with progress participle processing method according to a second embodiment of the present invention.
Referring to Fig. 6, carry out participle with participle object " I loves Beijing Tian An-men " to illustrate according to the present invention the The carry out participle processing method of two embodiments.
In step s 201, the sentence is divided into:My love, north, capital, day, peace, door.
In step S202, sequence labelling is carried out to the word segmentation result in step S201, sequence labelling processing includes such as Fig. 7 Shown step S2021 to S2024.
In storing step S2021, the grand master pattern shape parameter of sequence criteria model is divided into the storage of three parts, is characterized respectively Function (the second parameter), emission probability and state probability (the first parameter), feature templates and other specification (the 3rd parameter).Its In, emission probability and state probability are stored (the first file) as a unique file.
In calculation procedure S2022, Hash operation is carried out to characteristic function using severe snow hash algorithm, then by each feature The storage location (value) of function and the emission probability corresponding with this feature function or state probability, is stored in another with cryptographic Hash In individual binary file (the second file).Storage feature templates and other specification are stored as the 3rd file.
Specifically, the template is when sequence labelling model is placed in " north " word, feature templates:U06:%x [0, 0]/%x [1,0], its template is construed to current word and combining for its latter position word situation occurs i.e.:U06:North/capital.We will “U06:North/capital " is brought into severe snow hash function as variable, and calculating obtains three cryptographic Hash:Main cryptographic Hash M, left cryptographic Hash L With right cryptographic Hash R.Wherein, binary system displacement operation is carried out using main cryptographic Hash to obtain storage value (i.e. character pair is specifically Location, such as address of " north " and " capital " among file), and by the left cryptographic Hash and right cryptographic Hash of acquisition and the left Kazakhstan that pre-sets Uncommon value and right cryptographic Hash compare, if identical, it is determined that be stored in the storage location as main cryptographic Hash in the second file;Such as Fruit is true, then continues to take out emission probability (or state probability) storage location of storage inside, if vacation, then return to -1;Such as It is really unequal, then Jia 1 on the basis of M, repeat above-mentioned value and compare operation.
In extraction step S2023, characteristic function and the hair corresponding with this feature function are stored from step S2022 The storage location of probability or state probability is penetrated, adjacent word is extracted as the probability size of a joint word.
Specifically, will be repeated in step S2032 value compare the return value (address) of operation will be as emission probability Position in (or state probability) its first file, carries out position value operation.The weights or probability number of taking-up and sequence mark The label of note number is identical, when each weights represent current word label as B, " U06:The probability that north/capital " joint occurs is big It is small.For example, the probability that " Beijing " joint occurs is 98%, the probability that " Tian An-men " joint occurs is 95%.
In combining step S2024, according to the probability calculated in step S2023, the word segmentation result in step 201 is carried out Splicing.
Specifically, it is in step S201 word segmentation result:My love, north, capital, day, peace, door.According in step S2023 The probability of calculating, " Beijing " is 98% as the probability of a joint word, the probability that " Tian An-men " occurs as a joint word For 95%, it is thus determined that " north " and " capital " is spliced into " Beijing ", " my god ", " peace " and " door " be spliced into " Tian An-men ".In step In S2024, the word segmentation result finally obtained is:My love, Beijing, Tian An-men.
According to the second embodiment of the present invention, sequence labelling processing is carried out in external memory rather than in internal memory, is reduced pair The occupancy of information processor internal memory, improves the speed of service of information processor.
[3rd embodiment]
The hardware configuration of the information processor of the third embodiment of the present invention and first embodiment and second embodiment The hardware configuration of information processor is identical.The technical scheme of 3rd embodiment is the technology of first embodiment and second embodiment The combination of scheme.That is, the information processor of 3rd embodiment includes the selecting unit in first embodiment, the first concatenation unit With the second concatenation unit, and external memory storage and sequence labelling unit in second embodiment.
High-level schematic functional block diagrams of the Fig. 8 exemplified with information processor according to a third embodiment of the present invention.
As shown in figure 8, information processor 3000 includes:Selecting unit 3101, to participle object, (such as user is by touching for it Touch the sentence of screen input) participle is carried out, obtain the word segmentation result represented with the group including multiple words;First concatenation unit 3102 Splicing is carried out to the adjacent word in group;Sequence labelling unit 3103 utilizes sequence labelling model, to being spliced by described first Each word that unit is carried out in the combination after splicing carries out sequence labelling, and according to the result of sequence labelling to described group Word in conjunction is merged;Second concatenation unit 3104 according to pre-defined rule to being merged by the sequence labelling unit after Word in combination is spliced.
Wherein, sequence labelling unit 3103 includes:Storage part 31031, it is by the emission probability and shape of sequence criteria model State probability is stored in the first file of external memory storage;Calculating part 31032, it is the characteristic function to the word in the combination Hash operation is carried out, by the storage position of each characteristic function and the emission probability corresponding with this feature function or state probability Put, be stored in cryptographic Hash in the second file;Extraction unit 31033, it is carried from the storage location stored by the calculating part Adjacent word in the combination is taken as the probability of a joint word;And merging portion 31034, it is configured as according to being extracted Probability splices to each word in the combination.
In the participle processing method of 3rd embodiment, including selection step, the first splicing step in first embodiment With second splicing step, and first splicing step and second splicing step between sequence labelling step, then be second embodiment In sequence labelling step.
According to a third embodiment of the present invention, result in that a kind of participle granularity is big and committed memory is few and handles Fireballing information processor.
The information processor of the present invention, results in following technique effect:Have as far as possible common collocation and semantically The combination of meaning is cut out, in that context it may be convenient to more meaningful fragment is extracted from word segmentation result.
Although with reference to exemplary embodiment, invention has been described above, above-described embodiment is only to illustrate this hair Bright technical concepts and features, it is not intended to limit the scope of the present invention.It is all to be done according to spirit of the invention Any equivalent variations or modification, should all be included within the scope of the present invention.

Claims (22)

1. a kind of can carry out the information processor of word segmentation processing, it is characterised in that described information processing unit includes:
Selecting unit, it is configured to carry out participle to participle object, obtains the word segmentation result for being expressed as multiple contaminations;
First concatenation unit, it is configured as carrying out splicing to the adjacent word in the combination;
Sequence labelling unit, it is configured to, with sequence labelling model, to carrying out splicing by first concatenation unit Each word in the combination afterwards carries out sequence labelling, and the word in the combination is closed according to the result of sequence labelling And;And
Second concatenation unit, it is configured to spell the word after being merged by the sequence labelling unit according to pre-defined rule Connect.
2. information processor according to claim 1, wherein, the pre-defined rule is included may be with thing in adjacent word Part, date, numeral-classifier compound or the relevant word of letter expressing are spliced.
3. information processor according to claim 1, wherein, the sequence labelling unit includes:
Extraction unit, it is configured as from each participle in the combination for carrying out after splicing by first concatenation unit Extract the word of predefined type;
Prediction section, it is configured as according to the predefined type, to predict the corresponding word segmentation result of extracted word;
Selector, it is configured as from the word segmentation result predicted selecting word segmentation result;And
Merging portion, it is configured as, according to by the selected word segmentation result of the selector, carrying out the word in the combination Merge.
4. information processor according to claim 2, wherein, the predefined type includes name, place name and mechanism name.
5. information processor according to claim 1, wherein, the selecting unit calculates institute respectively according to participle strategy State the score of multiple combinations, and select from the multiple combination the combination of highest scoring.
6. information processor according to claim 4, wherein, institute's participle strategy includes term weighing and language model is obtained Point.
7. a kind of can carry out the information processor of word segmentation processing, described information processing unit includes storage sequence labelling model External memory storage, it is characterised in that described information processing unit includes:
Participle unit, it is configured as carrying out participle object participle, and obtains the word segmentation result for being expressed as multiple contaminations; And
Sequence labelling unit, its be configured as being directed to by participle object carry out participle acquisition, be expressed as multiple contaminations Word segmentation result, sequence labelling processing is carried out using sequence labelling model to the word in the combination, and according to the knot of sequence labelling Fruit merges to the word in the combination,
Wherein, the sequence labelling unit carries out the sequence labelling processing in the external memory storage.
8. information processor according to claim 7, wherein, the sequence labelling unit is carrying out the sequence labelling During processing, by calculating address of the sequence labelling model in the external memory storage, from sequence described in the address acquisition Corresponding informance of the row marking model in the external memory storage, to use the sequence criteria model.
9. information processor according to claim 7, wherein, the external memory storage is hard disk.
10. information processor according to claim 7, wherein, the sequence labelling unit includes:
Storage part, it is configured as the emission probability and state probability of sequence criteria model being stored in the first of external memory storage In file;
Calculating part, it is configured as carrying out Hash operation to the characteristic function of the word in the combination, by each characteristic function and The emission probability corresponding with this feature function or the storage location of state probability, are stored in the second file with cryptographic Hash;
Extraction unit, it is configured as from the storage location stored by the calculating part, is extracted adjacent word in the combination and is made For the probability of a joint word;
Merging portion, it is configured as splicing each word in the combination according to the probability extracted.
11. information processor according to claim 10,
Wherein, the calculating part obtains the main cryptographic Hash of characteristic function, left Kazakhstan by carrying out Hash operation to the characteristic function Uncommon value and right cryptographic Hash,
Wherein, the storage location is stored in the second file with main cryptographic Hash, and
The left cryptographic Hash and right cryptographic Hash are used to determine whether to store the storage location.
12. a kind of participle processing method for information processor, the participle processing method comprises the following steps:
Step is selected, participle is carried out to participle object, and obtain the word segmentation result for being expressed as multiple contaminations;
First splicing step, splicing is carried out to the adjacent word in the combination;
Sequence labelling step, using sequence labelling model, described in being carried out in the described first splicing step after splicing Each word in combination carries out sequence labelling, and the word in the combination is merged according to the result of sequence labelling;And
Second splicing step, its be configured according to pre-defined rule to being merged in the sequence labelling step after combination in Word spliced.
13. participle processing method according to claim 12, wherein, the pre-defined rule is included may be with adjacent word Event, date, numeral-classifier compound or the relevant word of letter expressing are spliced.
14. participle processing method according to claim 12, wherein, the sequence labelling step includes:
Extract predetermined in extraction step, each participle from the combination after splicing is carried out by the described first splicing step The word of type;
Prediction steps:It is configured as the corresponding word segmentation result that extracted word is predicted according to the predefined type;
Step is selected, it is configured as from the word segmentation result predicted selecting word segmentation result;And
Combining step, it is configured as according to by the selected word segmentation result of the selector, to enter to the word in the combination Row merges.
15. participle processing method according to claim 12, wherein, the predefined type includes name, place name and mechanism Name.
16. participle processing method according to claim 12, wherein, in the selection step, according to participle strategy point Do not calculate the score of multiple combinations, and select from the multiple combination the combination of highest scoring.
17. participle processing method according to claim 12, wherein, institute's participle strategy includes term weighing and language model Score.
18. a kind of participle processing method for information processor, described information processing unit includes storage sequence labelling mould The external memory storage of type, the participle processing method comprises the following steps:
Participle step, participle is carried out to participle object, and obtain the word segmentation result for being expressed as multiple contaminations;
Sequence labelling step, for by participle object carry out participle acquisition, be expressed as the word segmentation result of multiple contaminations, profit Carry out sequence labelling processing to the word in the combination with sequence labelling model, and according to the result of sequence labelling to the combination In word merge,
Wherein, in sequence labelling step, sequence labelling processing is carried out in the external memory storage.
19. participle processing method according to claim 18, wherein, when carrying out the sequence labelling processing, pass through meter Address of the sequence labelling model in the external memory storage is calculated, from sequence labelling model described in the address acquisition in institute The corresponding informance in external memory storage is stated, to use the sequence criteria model.
20. participle processing method according to claim 18, wherein, the external memory storage is hard disk.
21. participle processing method according to claim 18, wherein, the sequence labelling step includes:
Storing step, the emission probability and state probability of sequence criteria model are stored in the first file;
Calculation procedure, carries out Hash operation, by each characteristic function and and this feature to the characteristic function of the word in the combination The storage location of the corresponding emission probability of function or state probability, is stored in the second file with cryptographic Hash;
Extraction step, the storage location stored from the calculation procedure extracts adjacent word in the combination and is used as one The probability of individual joint word;
Combining step, splices according to the probability extracted to each word in the combination.
22. participle processing method according to claim 21, wherein, wherein, in the calculation procedure, by described Characteristic function carries out main cryptographic Hash, left cryptographic Hash and the right cryptographic Hash that Hash operation obtains characteristic function,
Wherein, the storage location is stored in the second file with main cryptographic Hash, and
The left cryptographic Hash and right cryptographic Hash are used to determine whether to store the storage location.
CN201710505392.0A 2017-06-28 2017-06-28 Information processing unit and its participle processing method Active CN107291695B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811400632.1A CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof
CN201710505392.0A CN107291695B (en) 2017-06-28 2017-06-28 Information processing unit and its participle processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710505392.0A CN107291695B (en) 2017-06-28 2017-06-28 Information processing unit and its participle processing method

Related Child Applications (1)

Application Number Title Priority Date Filing Date
CN201811400632.1A Division CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof

Publications (2)

Publication Number Publication Date
CN107291695A true CN107291695A (en) 2017-10-24
CN107291695B CN107291695B (en) 2019-01-11

Family

ID=60098659

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201710505392.0A Active CN107291695B (en) 2017-06-28 2017-06-28 Information processing unit and its participle processing method
CN201811400632.1A Active CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201811400632.1A Active CN109492228B (en) 2017-06-28 2017-06-28 Information processing apparatus and word segmentation processing method thereof

Country Status (1)

Country Link
CN (2) CN107291695B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766539A (en) * 2018-11-30 2019-05-17 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN115497465A (en) * 2022-09-06 2022-12-20 平安银行股份有限公司 Voice interaction method and device, electronic equipment and storage medium

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111339250B (en) * 2020-02-20 2023-08-18 北京百度网讯科技有限公司 Mining method for new category labels, electronic equipment and computer readable medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060155782A1 (en) * 2005-01-11 2006-07-13 Viktors Berstis Systems, methods, and media for aggregating electronic document usage information
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN103646088A (en) * 2013-12-13 2014-03-19 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103984735A (en) * 2014-05-21 2014-08-13 北京京东尚科信息技术有限公司 Method and device for generating recommended delivery place name
CN104469002A (en) * 2014-12-02 2015-03-25 科大讯飞股份有限公司 Mobile phone contact person determination method and device
CN105095391A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Device and method for identifying organization name by word segmentation program
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN106021229A (en) * 2016-05-19 2016-10-12 苏州大学 Chinese event co-reference resolution method and system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7393665B2 (en) * 2005-02-10 2008-07-01 Population Genetics Technologies Ltd Methods and compositions for tagging and identifying polynucleotides
CN102360383B (en) * 2011-10-15 2013-07-31 西安交通大学 Method for extracting text-oriented field term and term relationship
CN103823862B (en) * 2014-02-24 2017-02-15 西安交通大学 Cross-linguistic electronic text plagiarism detection system and detection method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060155782A1 (en) * 2005-01-11 2006-07-13 Viktors Berstis Systems, methods, and media for aggregating electronic document usage information
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN103646088A (en) * 2013-12-13 2014-03-19 合肥工业大学 Product comment fine-grained emotional element extraction method based on CRFs and SVM
CN103984735A (en) * 2014-05-21 2014-08-13 北京京东尚科信息技术有限公司 Method and device for generating recommended delivery place name
CN104469002A (en) * 2014-12-02 2015-03-25 科大讯飞股份有限公司 Mobile phone contact person determination method and device
CN105095391A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Device and method for identifying organization name by word segmentation program
CN105718586A (en) * 2016-01-26 2016-06-29 中国人民解放军国防科学技术大学 Word division method and device
CN106021229A (en) * 2016-05-19 2016-10-12 苏州大学 Chinese event co-reference resolution method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766539A (en) * 2018-11-30 2019-05-17 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN115497465A (en) * 2022-09-06 2022-12-20 平安银行股份有限公司 Voice interaction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN109492228B (en) 2020-01-14
CN107291695B (en) 2019-01-11
CN109492228A (en) 2019-03-19

Similar Documents

Publication Publication Date Title
US10713432B2 (en) Classifying and ranking changes between document versions
US10733197B2 (en) Method and apparatus for providing information based on artificial intelligence
KR102462365B1 (en) Method and apparatus for predicting text input based on user demographic information and context information
US10102191B2 (en) Propagation of changes in master content to variant content
US20160306800A1 (en) Reply recommendation apparatus and system and method for text construction
CN106934069B (en) Data retrieval method and system
CN108875059B (en) Method and device for generating document tag, electronic equipment and storage medium
US20210256076A1 (en) Integrated browser experience for learning and automating tasks
CN107330120A (en) Inquire answer method, inquiry answering device and computer-readable recording medium
CN101004737A (en) Individualized document processing system based on keywords
WO2018128658A1 (en) Search engine
CN107291695B (en) Information processing unit and its participle processing method
CN112380331A (en) Information pushing method and device
US10896291B2 (en) Method and device for providing notes by using artificial intelligence-based correlation calculation
CN105631052A (en) Artificial intelligence based retrieval method and artificial intelligence based retrieval device
US11321531B2 (en) Systems and methods of updating computer modeled processes based on real time external data
CN112905787B (en) Text information processing method, short message processing method, electronic device and readable medium
WO2022232673A1 (en) Computer-based systems involving machine learning associated with generation of recommended content and methods of use thereof
CN115048536A (en) Knowledge graph generation method and device, computer equipment and storage medium
CN114020245A (en) Page construction method and device, equipment and medium
CN109829157B (en) Text emotion presenting method, text emotion presenting device and storage medium
CN113961811A (en) Conversational recommendation method, device, equipment and medium based on event map
CN113779994A (en) Element extraction method and device, computer equipment and storage medium
CN113095078A (en) Associated asset determination method and device and electronic equipment
US11042538B2 (en) Predicting queries using neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Wang Zhuoran

Inventor after: Qi Chao

Inventor after: Ma Yuchi

Inventor after: Hou Xinglin

Inventor before: Hou Xinglin

Inventor before: Qi Chao

Inventor before: Wang Zhuoran

Inventor before: Ma Yuchi

CB03 Change of inventor or designer information
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200727

Address after: 518000 Nanshan District science and technology zone, Guangdong, Zhejiang Province, science and technology in the Tencent Building on the 1st floor of the 35 layer

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Address before: 100029, Beijing, Chaoyang District new East Street, building No. 2, -3 to 25, 101, 8, 804 rooms

Patentee before: Tricorn (Beijing) Technology Co.,Ltd.

TR01 Transfer of patent right