CN104462058B - Character string identification method and device - Google Patents
Character string identification method and device Download PDFInfo
- Publication number
- CN104462058B CN104462058B CN201410579684.5A CN201410579684A CN104462058B CN 104462058 B CN104462058 B CN 104462058B CN 201410579684 A CN201410579684 A CN 201410579684A CN 104462058 B CN104462058 B CN 104462058B
- Authority
- CN
- China
- Prior art keywords
- substring
- character string
- type
- word
- identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention relates to a kind of character string identification method and devices, the described method comprises the following steps in one embodiment:Character string is obtained, the character string is made of multiple types substring;The character string is segmented according to the substring type of the multiple types substring and combinations thereof, the character string is divided at least one substring;Judge whether at least one substring is that word converges, and it is the vocabulary for having unique meaning in the affiliated languages of the substring that the word, which converges,;If it is that word converges to judge the substring not, processing is identified at least one substring;And all substrings after identification are synthesized into Connected Speech.According to the method for the embodiment of the present invention and device, the meaning of character string can be accurately identified.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of character string identification method and device.
Background technology
The development of present computer technology, phonetic synthesis also occur therewith, and phonetic synthesis will arbitrary text information reality
When be converted into the massage voice reading of standard smoothness and come out.This mode either content, storage, transmission or convenience, in time
Property etc. all facilitate user transmit message and read message.But all there are many pronunciation, different pronunciations for a large amount of character strings
Also there are different meanings, only correct pronunciation that could give expression to appropriate meaning after synthesizing voice.Therefore in phonetic synthesis
When, the meaning of a word of accurate identification string is particularly important.
Invention content
In view of this, a kind of character string identification method of present invention offer and device, can accurately identify the meaning of character string.
A kind of character string identification method, the described method comprises the following steps:
Character string is obtained, the character string is made of multiple types substring;
The character string is segmented according to the substring type of a plurality of types of substrings and combinations thereof,
The character string is divided at least one substring;
Judge whether at least one substring is that word converges, and it is the affiliated languages of the substring that the word, which converges,
In have the vocabulary of unique meaning;
If it is that word converges to judge the substring not, processing is identified at least one substring;With
And
All substrings after identification are synthesized into Connected Speech.
A kind of character string identification device, described device comprise the following modules:
Acquisition module, for obtaining character string, the character string is made of multiple types substring;
Word-dividing mode is used for the character string according to the substring class of the multiple types substring and combinations thereof
Type is segmented, and the character string is divided at least one substring;
Judgment module, for judging whether at least one substring is that word converges, it is the son that the word, which converges,
There is the vocabulary of unique meaning in the affiliated languages of character string;
Processing module, if being that word converges for judging the substring not, by least one substring into
Row identifying processing;And
Synthesis module, for all substrings synthesis Connected Speech after identifying.
According to the method and device of above-described embodiment, by being segmented to character string according to the classification of character string, then
It is identified by word, improves the accuracy of character string identification.
For the above and other objects, features and advantages of the present invention can be clearer and more comprehensible, preferred embodiment cited below particularly,
And coordinate institute's accompanying drawings, it is described in detail below.
Description of the drawings
Fig. 1 is a kind of structure diagram of electronic device.
Fig. 2 is the character string identification method flow chart that first embodiment provides.
Fig. 3 is the character string identification method flow chart that second embodiment provides.
Fig. 4 is the character string identification method flow chart that 3rd embodiment provides.
Fig. 5 is the character string identification method flow chart that fourth embodiment provides.
Fig. 6 is the character string identification method flow chart that the 5th embodiment provides.
Fig. 7 is the character string identification device structure diagram that sixth embodiment provides.
Fig. 8 is the character string identification device structure diagram that the 7th embodiment provides.
Fig. 9 is the character string identification device structure diagram that the 8th embodiment provides.
Figure 10 is the character string identification device structure diagram that the 9th embodiment provides.
Figure 11 is the character string identification device structure diagram that the tenth embodiment provides.
Specific implementation mode
Further to illustrate that the present invention is the technological means and effect realized predetermined goal of the invention and taken, below in conjunction with
Specific implementation mode, structure, feature and its effect according to the present invention is described in detail as after in attached drawing and preferred embodiment.
A kind of character string identification method involved in the embodiment of the present invention and device, can be used for character string in phonetic synthesis
Identification, specific its can be used in electronic device.
Fig. 1 is the structure diagram of above-mentioned electronic device.As shown in Figure 1, electronic device 100 include one or more (in figure
Only show one) processor 102, memory 104, RF (Radio Frequency, radio frequency) module 106, network module 108, sound
Frequency module 110, input module 112, display module 114,.It will appreciated by the skilled person that structure shown in FIG. 1 is only
For signal, the structure of electronic device 100 is not caused to limit.For example, electronic device 100 may also include than shown in Fig. 1
More either less components or with the configuration different from shown in Fig. 1.The specific example packet of above-mentioned electronic device 100
Include but be not limited to handheld computer, mobile phone, media player, mobile unit, personal digital assistant and aforementioned device
Various combinations.
It will appreciated by the skilled person that for processor 102, every other component belongs to outer
If being coupled by multiple Peripheral Interfaces 124 between processor 102 and these peripheral hardwares.Peripheral Interface 124 can be based on following standard
It realizes:Universal Asynchronous Receive/sending device (Universal Asynchronous Receiver/Transmitter, UART),
Universal input/output (General Purpose Input Output, GPIO), Serial Peripheral Interface (SPI) (Serial
Peripheral Interface, SPI), internal integrated circuit (Inter-Integrated Circuit, I2C), but not and limit
In above-mentioned standard.In some instances, Peripheral Interface 124 can only include bus;In other examples, Peripheral Interface 124 is also
May include other elements, display controller such as one or more controller, such as connecting liquid crystal display panel or
Storage control 122 for connecting memory.In addition, this this controller can also be detached from Peripheral Interface 124, and
It is integrated in the interior or corresponding peripheral hardware of processor 102.
Memory 104 can be used for storing software program and module, as the character string identification method in the embodiment of the present invention/
Corresponding program instruction/the module of device, processor 102 are stored in software program and module in memory 104 by operation,
To perform various functions application and data processing, that is, realize above-mentioned character string identification method.Memory 104 may include height
Fast random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or other
Non-volatile solid state memory.In some instances, memory 104 can further comprise remotely located relative to processor 102
Memory, these remote memories can pass through network connection to electronic device 100.The example of above-mentioned network includes but unlimited
In internet, intranet, LAN, mobile radio communication and combinations thereof.
RF modules 106 realize the mutual conversion of electromagnetic wave and electric signal, thus with logical for receiving and transmitting electromagnetic wave
News network or other equipment are communicated.RF modules 106 may include the various existing circuit elements for executing these functions
Part, for example, antenna, RF transceiver, digital signal processor, encryption/deciphering chip, subscriber identity module (SIM) card, storage
Device etc..RF modules 106 can carry out communication with various networks such as internet, intranet, wireless network or by wireless
Network is communicated with other equipment.Above-mentioned wireless network may include cellular telephone networks, WLAN or Metropolitan Area Network (MAN).
Above-mentioned wireless network can use various communication standards, agreement and technology, including but not limited to global system for mobile communications
(Global System for Mobile Communication, GSM), enhanced mobile communication technology (Enhanced Data
GSM Environment, EDGE), Wideband CDMA Technology (wideband code division multiple
Access, W-CDMA), Code Division Multiple Access (Code division access, CDMA), time division multiple access technology (time
Division multiple access, TDMA), adopting wireless fidelity technology (Wireless, Fidelity, WiFi) (such as U.S.'s electricity
Gas and Electronic Engineering Association standard IEEE 802.11a, IEEE 802.11b, IEEE802.11g and/or IEEE
802.11n), the networking telephone (Voice over internet protocal, VoIP), worldwide interoperability for microwave accesses
(Worldwide Interoperability for Microwave Access, Wi-Max), other for mail, Instant Messenger
The agreement and any other suitable communications protocol of news and short message, or even may include that those are not developed currently yet
Agreement.
Network module 108 is for receiving and transmitting network signal.Above-mentioned network signal may include wireless signal or have
Line signal.In an example, above-mentioned network signal is WiFi signal, since the working frequency of WiFi is also at the frequency range of radio frequency
Interior, network module can have the hardware configuration similar with RF modules 106 at this time, you can including antenna, RF transceiver, number letter
The elements such as number processor, encryption/deciphering chip.In an example, above-mentioned network signal is cable network signal.At this point, net
Network module 108 may include the elements such as processor, random access memory, converter, crystal oscillator.
Voicefrequency circuit 110, loud speaker, sound jack, microphone are provided jointly between user and electronic apparatus 100
Audio interface.Specifically, voicefrequency circuit 110 receives voice data from processor 102, and voice data is converted to electric signal,
By electric signal transmission to loud speaker.Loud speaker 101 converts electrical signals to the sound wave that human ear can be heard.Voicefrequency circuit 110 also from
Receive electric signal at microphone, convert electrical signals to voice data, and by data transmission in network telephony to processor 102 with into traveling
The processing of one step.Audio data can obtain from memory 104 or by RF modules 106, network module 108.In addition, sound
Frequency evidence can also be stored into memory 104 or be sent by RF modules 106 and network module 108.
Input unit 112 can be used for receiving the character information of input, and generation has with user setting and function control
Keyboard, mouse, operating lever, optics or the input of trace ball signal of pass.Specifically, input unit 112 may include button and
Touch-control surface.Button for example may include the character keys for inputting character, and the control button for triggering control function.
The example of control button includes " returning to main screen " button, on/off button, camera button etc..Touch-control surface collects user
On it or neighbouring touch operation (such as user using any suitable object or attachment such as finger, stylus in touch-control surface
The upper or operation near touch-control surface), and corresponding attachment device is driven according to a pre-set procedure.Optionally, touch-control
Surface may include both touch detecting apparatus and touch controller.Wherein, the touch side of touch detecting apparatus detection user
Position, and the signal that touch operation is brought is detected, transmit a signal to touch controller;Touch controller is from touch detecting apparatus
Touch information is received, and is converted into contact coordinate, then gives processor 102, and the order that processor 102 is sent can be received
And it is executed.Furthermore, it is possible to realize touch-control table using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves
Face.In addition to touch-control surface, input unit 112 can also include other input equipments.Other above-mentioned input equipments include but not
It is limited to one or more in physical keyboard, trace ball, mouse, operating lever etc..
Display module 114 is used to show information input by user, is supplied to user information and electronic device 100
Various graphical user interface, these graphical user interface can be made of figure, text, icon, video and its arbitrary combination.
In an example, display module 114 includes a display panel.Display panel may be, for example, a liquid crystal display panel
(Liquid Crystal Display, LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode
Display, OLED) display panel, electrophoretic display panel (Electro-Phoretic Display, EPD) etc..Further,
Touch-control surface may be disposed on display panel to constitute an entirety with display panel.In further embodiments, mould is shown
Block 114 may also include other kinds of display device, such as including a projection display equipment.Compared to general display surface
Plate, projection display equipment also need to include some component such as lens groups for projection.
First embodiment
Fig. 2 is a kind of character string identification method flow chart provided in this embodiment, as shown in Fig. 2, the method for the present embodiment
Include the following steps:
Step S101, character string is obtained, the character string is made of multiple types substring.
The character string can be the character string inputted immediately by user, can also be existing word in Current electronic device
Symbol string.In an example, the method in the present embodiment is used in a immediate communication tool, the first user terminal and second user
Character string is sent between end mutually, the acquisition character string can be that the character string that current interface receives can also communication tool
Character string in historical record.In another example, the method for the present embodiment can be used in a translation software, the character
String can be that electronic device receives character string input by user.
It is appreciated that character string is there are many type, for example, Arabic, at noon, English, number symbol and its appoints
The types such as the combination of meaning.The multiple types character string also Corresponding matching respective profiles, the configuration file is for marking
The character string type to prestore, which corresponds to, determines target type.The addend word for example, number is put in marks
" Number2Punction2Number " can be expressed as decimal, telephone number, numerical value etc..For example, " 2.13 ", " 010-
88888888”.Corresponding configuration is:“Number2Punction2Number:Decimal, Telephone ".Further, described
The meaning that character string defines can be changed and be increased to configuration file.Such as character string " 3,247 " belongs to above-mentioned number and puts in marks
Addend word " Number2Punction2Number " type, but " 3,247 " are not belonging to the class being arranged in configuration file to character string
Type belongs to numerical value.Increase target type " Numerical " then can be carried out in the configuration file to above-mentioned character string type.
Step S102, by the character string according to the substring type of the multiple types substring and combinations thereof into
Row participle, is divided at least one substring by the character string.
In one embodiment, character string is divided into four major class character strings:English (English) indicates Chinese character
(Kanji), symbol (Punctuation), digital (Number).Above-mentioned four classes character string can also be combined arbitrarily, for example,
English2Number:The type of expression English addend word, Type Length 2, such as " CA1419 ";
Number2Punctuation2Number:Indicate the type of digital addend word of putting in marks, Type Length 3, for example, " 010-
88888888”;Number2Kanji:Indicate number plus the type of Chinese character, Type Length 2, for example, 2014.It can be according to
English (English) indicates Chinese character (Kanji), symbol (Punctuation), digital (Number) and combinations thereof participle.
In an example, by sentence " China Mobile (0941) March 16 is in Hong Kong publication 2005 wealth year business performance "
Segmented " China/movement/(/ 0941/)/March/16 day/in the wealth year of/Hong Kong/publication/2005//operation/achievement ".Further
Each substring is also marked part of speech by ground when being segmented.For example, " China " mark part of speech " Kanji ", " March " mark word
Property " Number2 Kanji ".By marking the part of speech of a substring, when can be used for substring identifying processing, as front and back
The reference information of substring.
Step S103, judge whether at least one substring is that word converges.
It is the vocabulary for having unique meaning in the affiliated languages of the substring that the word, which converges,.Only have when being exported with spoken language
Unique pronunciation.For example, it is understood that " China " has in Chinese uniquely contains if the substring is " China "
Justice then can determine that character string " China " is converged for word.For example, substring " China " also has unique meaning in English, also may be used
It is converged for word with judgement " China ".
In embodiments of the present invention, classified according to the four of above-mentioned character string kinds, if it is appreciated that the substring
Then ambiguity can be not present under normal circumstances with its meaning of Direct Recognition for Chinese or English word.For example, " China " is in voice
It directly can sequentially be understood when synthesis.Judge whether substring is English or Chinese, it then can be straight if Chinese and English word
Reading is connect, the identification of meaning need not be carried out again.If not Sino-British word converges, then the deciphering for carrying out ambiguity, such as " 2001 are needed
Year ", which can be understood as " in two thousand 01 ", can also be read as " in 2001 ".
If it is that word converges step S104, to judge the substring not, at least one substring is known
It manages in other places.
In one embodiment, the method for the present embodiment is used for phonetic synthesis.Phonetic synthesis, also known as literary periodicals
(Text to Speech) technology, the massage voice reading that can convert arbitrary text information in real time standard smoothness come out, are equivalent to
Artificial face has been loaded onto to machine.For synthetic language, in addition to depending on various rules, including semantics rule, lexical rule,
Phonetics rule is outer, it is necessary to is well understood by having in word, the problem of this also relates to natural language understanding.
For the character strings of combination of above-mentioned four major types, there may be the multiple meanings of ambiguity, same type of character strings
It can indicate a plurality of types of contents, then the meaning to substring in current string is needed to be identified.
For example, " 120 " can indicate that ambulance call pronounces " one 20 ", it can also indicate that numerical value is pronounced " 102 ".Then
It can be identified according to the meaning of front and back substring, such as in an example, " dialing 120 ambulance calls " then can basis
Character string " ambulance call " judges " 120 " as telephone number below.
" Number2Punction2Number " type can be expressed as decimal, telephone number, numerical value etc..For example, " 2014
Hundred million yuan/RMB of year/China/movement/business revenue/3,247/ ", therein " 3,247 " can be according to front and back character string " hundred million yuan "
It is judged as numerical value.For example, " 010-88888888 ", which is also " Number2Punction2Number " type, indicates telephone number.Example
Such as, " 2014 " in above-mentioned example can indicate that " 2,014 years " can also indicate " in 2014 ".It then can basis
Front and back character string information establishes Matching Model, and by model treatment, then the result of preference pattern is as final recognition result.
It can be used in one example " conditional random field models (CRF models) ".The conditional random field models have undirected graph model,
Vertex in figure represents stochastic variable, and the line between vertex represents the dependence relation between stochastic variable, in condition random field, with
Machine variable Y is distributed as conditional probability, and given observed value is then stochastic variable X.In principle, the graph model cloth of condition random field
Office can be any given, and general common layout is the framework of chain eliminant." 2014 " in above-mentioned example can basis
Subsequent multiple character strings " China/movement/business revenue " are judged as " in 2014 ", rather than " 2,014 years ".
For example, having exact meaning for number plus percentage symbol, percentage is indicated.Then matched with general rule
Identification.For example, number plus percentage sign indicate percentage.
For example, character string " jpg ", " gif " etc. is picture/mb-type character.Default rule can be then set, appearance is worked as
" BMP ", " JPG ", " GIF ", " PNG " are then identified as picture format, can directly be understood in turn according to letter, number in character string.
Step S105, all substrings after identification are synthesized into Connected Speech.
The character string of above-mentioned identification is changed into spoken output that can listen to understand, fluent.
Further, the method for the present embodiment, can also be by the character string phonetic synthesis after identification.
It is that other character string is segmented, then is identified respectively to substring by treating according to the method for the present embodiment
Processing, improves the accuracy of identification.
Second embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists
In as shown in figure 3, step S104 further includes specifically:
Step S201, according to substring described in the content recognition of the corresponding front and back character string of the substring.
Step S202, the substring after identification is synthesized into voice.
The method of the present embodiment can be identified according to the substring of above or below.According to front and back substring part
Ambiguity is not present in character string, then can obtain a result.
For example, " 120 " can indicate that ambulance call pronounces " one 20 ", it can also indicate that numerical value is pronounced " 102 ".Then
It can be identified according to the meaning of front and back substring, such as in an example, " dialing 120 ambulance calls " then can basis
Character string " ambulance call " judges " 120 " as telephone number below.For example, " 2014/China/movement/business revenue/3,247/ hundred million
Member/RMB ", therein " 3,247 " can be judged as numerical value according to front and back character string " hundred million yuan ".Identify accurate result
The substring currently identified is synthesized into voice again.
According to the method for the present embodiment, when processing is identified to substring, pass through the information of front and back substring
The meaning for identifying substring, avoids the character string of more meanings from interfering, realizes higher accuracy rate.
3rd embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists
In as shown in figure 4, step S104 further includes specifically:
Step S301, string matching model is established, the meaning of the substring is identified according to the Matching Model.
Step S302, the substring after identification is synthesized into voice.
The multiple types substring also Corresponding matching respective profiles, the configuration file are described pre- for marking
The character string type deposited, which corresponds to, determines target type.Addend word " Number2Punction2Number " for example, number is put in marks
It can be expressed as decimal, telephone number, numerical value etc..Corresponding configuration is:“Number2Punction2Number:Decimal,
Telephone, Numerical ".It can be according to the corresponding of the character string Corresponding matching of substring corresponding types when identification string
Configuration file identifies.
For example, " the hundred million yuan/RMB of/China/movement/business revenue/3,247/ in 2014 ", therein " 3,247 " can basis
Front and back character string " hundred million yuan " is judged as numerical value.For example, " 010-88888888 " is also " Number2Punction2Number "
Type indicates telephone number.For example, " 2014 " in above-mentioned example can indicate that " 2,014 years " can also indicate " two
1 years ".Then Matching Model can be established according to front and back character string information, by model treatment, the then result of preference pattern
As final recognition result.In an example, it can be used " conditional random field models (CRF models) ".The condition random field
Model has undirected graph model, and the vertex in figure represents stochastic variable, and the line between vertex represents interdependent between stochastic variable
Relationship, in condition random field, stochastic variable Y's is distributed as conditional probability, and given observed value is then stochastic variable X.Principle
On, the graph model layout of condition random field can be any given, and general common layout is the framework of chain eliminant.Above-mentioned example
" 2014 " in son can be judged as " in 2014 " according to subsequent multiple character strings " China/movement/business revenue ", without
It is " 2,014 years ".It is understood that the Matching Model can also be other statistical models, such as Hidden Markov
Model (HMM model), conditional random field models (CRF models), maximum entropy model (ME models) etc..Finally by the character string of identification
Synthesize voice.
According to the method for the present embodiment, according to front and back information, still there may be ambiguities for partial character string, pass through foundation
With model, the character string information for comparing context identifies the meaning of current substring, to further increase character string identification
Accuracy rate.
Fourth embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists
In as shown in figure 5, step S104 further includes specifically:
Step S401, according to the meaning Direct Recognition of the substring.
Step S402, the substring after identification is synthesized into voice.
For example, having exact meaning for number plus percentage symbol, percentage is indicated.Then matched with general rule
Identification.For example, number plus percentage sign indicate percentage.
Process resource is saved, together for there is the character string Direct Recognition of direct clear meaning according to the method for the present embodiment
When also have higher accuracy rate.
5th embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists
In as shown in fig. 6, step S104 further includes specifically:
Step S501, it is identified according to default type according to the recognizable character string in the substring.
Step S502, the substring after identification is synthesized into voice.
There is the corresponding meaning given tacit consent to for some character strings, then the recognition rule of acquiescence can be set.
For example, character string " jpg ", " gif " etc. is picture/mb-type character.Default rule can be then set, appearance is worked as
" BMP ", " JPG ", " GIF ", " PNG " are then identified as picture format, can directly be understood in turn according to letter, number in character string.
When synthesizing voice then directly in order letter in composite characters string, number voice.
According to the method for the present embodiment, part special string can be directly identified according to the rule of acquiescence, it can
Special rules is defined, the recognition accuracy of character string is improved.
Sixth embodiment
The present embodiment provides a kind of character string identification devices, as shown in fig. 7, the device of the present embodiment includes:Acquisition module
601, word-dividing mode 602, judgment module 603, processing module 604 and synthesis module 605.
Acquisition module 601, for obtaining character string, the character string is made of multiple types substring.
It is appreciated that character string is there are many type, for example, Arabic, at noon, English, number symbol and its appoints
The types such as the combination of meaning.The multiple types character string also Corresponding matching respective profiles, the configuration file is for marking
The character string type to prestore, which corresponds to, determines target type.The addend word for example, number is put in marks
" Number2Punction2Number " can be expressed as decimal, telephone number, numerical value etc..For example, " 2.13 ", " 010-
88888888”.Corresponding configuration is:“Number2Punction2Number:Decimal, Telephone ".Further, described
The meaning that character string defines can be changed and be increased to configuration file.Such as character string " 3,247 " belongs to above-mentioned number and puts in marks
Addend word " Number2Punction2Number " type, but " 3,247 " are not belonging to the class being arranged in configuration file to character string
Type belongs to numerical value.Increase target type " Numerical " then can be carried out in the configuration file to above-mentioned character string type.
Word-dividing mode 602 is used for the character string according to the sub- character of the multiple types substring and combinations thereof
String type is segmented, and the character string is divided at least one substring.
In one embodiment, character string is divided into four major class character strings:English (English) indicates Chinese character
(Kanji), symbol (Punctuation), digital (Number).Above-mentioned four classes character string can also be combined arbitrarily, for example,
English2Number:The type of expression English addend word, Type Length 2, such as " CA1419 ";
Number2Punctuation2Number:Indicate the type of digital addend word of putting in marks, Type Length 3, for example, " 010-
88888888”;Number2Kanji:Indicate number plus the type of Chinese character, Type Length 2, for example, 2014.It can be according to
English (English) indicates Chinese character (Kanji), symbol (Punctuation), digital (Number) and combinations thereof participle.
In an example, by sentence " China Mobile (0941) March 16 is in Hong Kong publication 2005 wealth year business performance "
Segmented " China/movement/(/ 0941/)/March/16 day/in the wealth year of/Hong Kong/publication/2005//operation/achievement ".Further
Each substring is also marked part of speech by ground when being segmented.For example, " China " mark part of speech " Kanji ", " March " mark word
Property " Number2 Kanji ".By marking the part of speech of a substring, when can be used for substring identifying processing, as front and back
The reference information of substring.
Judgment module 603, for judging whether at least one substring is that word converges.
It is the vocabulary for having unique meaning in the affiliated languages of the substring that the word, which converges,.
In embodiments of the present invention, classified according to the four of above-mentioned character string kinds, if it is appreciated that the substring
Then ambiguity can be not present under normal circumstances with its meaning of Direct Recognition for Chinese or English word.For example, " China " is in voice
It directly can sequentially be understood when synthesis.Judge whether substring is English or Chinese, it then can be straight if Chinese and English word
Reading is connect, the identification of meaning need not be carried out again.If not Sino-British word converges, then the deciphering for carrying out ambiguity, such as " 2001 are needed
Year ", which can be understood as " in two thousand 01 ", can also be read as " in 2001 ".
Processing module 604, if being that word converges for judging the substring not, by least one substring
Processing is identified.
Substring there may be ambiguity is identified, obtains accurate result.
Synthesis module 605, for all substrings synthesis Connected Speech after identifying.By the character string of above-mentioned identification
It is changed into spoken output that can listen to understand, fluent.
It is that other character string is segmented by treating, at substring respectively identification according to the device of the present embodiment
Reason, improves the accuracy of identification.
7th embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists
In as shown in figure 8, described device further includes:
First recognition unit 6041, for son described in the content recognition according to the corresponding front and back character string of the substring
Character string;
Phonetic synthesis unit 6042, for the substring synthesis voice after identifying.
The other details of device about the present embodiment can also further regard to second embodiment, be not repeated herein.
According to the device of the present embodiment, when processing is identified to substring, pass through the information of front and back substring
The meaning for identifying substring, avoids the character string of more meanings from interfering, realizes higher accuracy rate.
8th embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists
In as shown in figure 9, described device further includes:
Second recognition unit 6043 identifies the sub- word for establishing string matching model according to the Matching Model
Accord with the meaning of string.
Phonetic synthesis unit 6042 is used for after the substring in the character string identifies, by the son after identification
Character string synthesizes voice.
The other details of device about the present embodiment can also further regard to 3rd embodiment, be not repeated herein.
According to the device of the present embodiment, according to front and back information, still there may be ambiguities for partial character string, pass through foundation
With model, the character string information for comparing context identifies the meaning of current substring, to further increase character string identification
Accuracy rate.
9th embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists
In as shown in Figure 10, described device further includes:
Third recognition unit 6044, for the meaning Direct Recognition according to the substring.
Phonetic synthesis unit 6042, for the substring synthesis voice after identifying.
The other details of device about the present embodiment can also further regard to fourth embodiment, be not repeated herein.
Process resource is saved, together for there is the character string Direct Recognition of direct clear meaning according to the device of the present embodiment
When also have higher accuracy rate.
Tenth embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists
In as shown in figure 11, described device further includes:
4th recognition unit 6045, for being carried out according to default type according to the recognizable character string in the substring
Identification.
Phonetic synthesis unit 6042, for the substring synthesis voice after identifying.
The other details of device about the present embodiment can also further regard to the 5th embodiment, be not repeated herein.
According to the device of the present embodiment, part special string can be directly identified according to the rule of acquiescence, it can
Special rules is defined, the recognition accuracy of character string is improved.
In addition, the embodiment of the present invention also provides a kind of computer readable storage medium, it is executable to be stored with computer
Instruction, above-mentioned computer readable storage medium is, for example, nonvolatile memory such as CD, hard disk or flash memory.It is above-mentioned
Computer executable instructions for allowing computer or similar arithmetic unit to complete in above-mentioned character string identification method
Various operations.
The above described is only a preferred embodiment of the present invention, be not intended to limit the present invention in any form, though
So the present invention has been disclosed with preferred embodiment as above, and however, it is not intended to limit the invention, any those skilled in the art, not
It is detached within the scope of technical solution of the present invention, when the technology contents using the disclosure above make a little change or are modified to equivalent change
The equivalent embodiment of change, as long as being without departing from technical solution of the present invention content, according to the technical essence of the invention to implementing above
Any simple modification, equivalent change and modification made by example, in the range of still falling within technical solution of the present invention.
Claims (8)
1. a kind of character string identification method, which is characterized in that the described method comprises the following steps:
Character string is obtained, the character string is made of multiple types substring, and the multiple types substring includes:English
Type, numeric type, sign pattern, Chinese character type and combinations thereof, the multiple types substring Corresponding matching corresponding configuration
File, the configuration file are used to mark the character string type to prestore to correspond to and determine target type;
The character string is segmented according to the substring type of the multiple types substring and combinations thereof, it will be described
Character string is divided at least one substring, each substring is marked part of speech when being segmented, the part of speech is for indicating
The type of each substring;
Judge whether at least one substring is that word converges, and it is to have in the affiliated languages of the substring that the word, which converges,
The vocabulary of unique meaning;
If it is that word converges to judge the substring not, processing is identified at least one substring;And
All substrings after identification are synthesized into Connected Speech;
The described substring is identified specifically includes:
String matching model is established according to front and back character string information, containing for the substring is identified according to the Matching Model
Justice selects the handling result of the Matching Model as recognition result;
The substring after identification is synthesized into voice.
2. character string identification method as described in claim 1, which is characterized in that described that tool is identified in the substring
Body includes:
According to substring described in the content recognition of the corresponding front and back character string of the substring;
The substring after identification is synthesized into voice.
3. character string identification method as described in claim 1, which is characterized in that described that tool is identified in the substring
Body includes:
According to the meaning Direct Recognition of the substring;
The substring after identification is synthesized into voice.
4. character string identification method as described in claim 1, which is characterized in that described that tool is identified in the substring
Body includes:
It is identified according to default type according to the recognizable character string in the substring;
The substring after identification is synthesized into voice.
5. a kind of character string identification device, which is characterized in that described device comprises the following modules:
Acquisition module, for obtaining character string, the character string is made of multiple types substring, the sub- word of multiple types
Symbol is gone here and there:English type, numeric type, sign pattern, Chinese character type and combinations thereof, the multiple types substring correspond to
Respective profiles are matched, the configuration file is used to mark the character string type to prestore to correspond to and determines target type;
Word-dividing mode, for by the character string according to the substring type of the multiple types substring and combinations thereof into
Row participle, is divided at least one substring by the character string, and each substring is marked part of speech when being segmented, described
Part of speech is used to indicate the type of each substring;
Judgment module, for judging whether at least one substring is that word converges, it is the sub- character that the word, which converges,
There is the vocabulary of unique meaning in languages belonging to string;
Processing module knows at least one substring if being that word converges for judging the substring not
It manages in other places;And
Synthesis module, for all substrings synthesis Connected Speech after identifying;
The processing module specifically includes:
Second recognition unit is known for establishing string matching model according to front and back character string information according to the Matching Model
The meaning of the not described substring, selects the handling result of the Matching Model as recognition result;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
6. character string identification device as claimed in claim 5, which is characterized in that the processing module specifically includes:
First recognition unit, for substring described in the content recognition according to the corresponding front and back character string of the substring
's;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
7. character string identification device as claimed in claim 5, which is characterized in that the processing module specifically includes:
Third recognition unit, for the meaning Direct Recognition according to the substring;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
8. character string identification device as claimed in claim 5, which is characterized in that the processing module specifically includes:
4th recognition unit, for being identified according to default type according to the recognizable character string in the substring;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410579684.5A CN104462058B (en) | 2014-10-24 | 2014-10-24 | Character string identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410579684.5A CN104462058B (en) | 2014-10-24 | 2014-10-24 | Character string identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104462058A CN104462058A (en) | 2015-03-25 |
CN104462058B true CN104462058B (en) | 2018-10-02 |
Family
ID=52908128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410579684.5A Active CN104462058B (en) | 2014-10-24 | 2014-10-24 | Character string identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104462058B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106156120B (en) * | 2015-04-07 | 2020-02-28 | 阿里巴巴集团控股有限公司 | Method and device for classifying character strings |
CN104881503A (en) * | 2015-06-24 | 2015-09-02 | 郑州悉知信息技术有限公司 | Data processing method and device |
CN105653517A (en) * | 2015-11-05 | 2016-06-08 | 乐视致新电子科技(天津)有限公司 | Recognition rate determining method and apparatus |
CN107705784B (en) * | 2017-09-28 | 2020-09-29 | 百度在线网络技术(北京)有限公司 | Text regularization model training method and device, and text regularization method and device |
CN109359274B (en) * | 2018-09-14 | 2023-05-02 | 蚂蚁金服(杭州)网络技术有限公司 | Method, device and equipment for identifying character strings generated in batch |
CN109857898A (en) * | 2019-02-20 | 2019-06-07 | 成都嗨翻屋科技有限公司 | A kind of method and system of mass digital audio-frequency fingerprint storage and retrieval |
CN110047464A (en) * | 2019-03-29 | 2019-07-23 | 联想(北京)有限公司 | Information processing method and device |
CN110705274B (en) * | 2019-09-06 | 2023-03-24 | 电子科技大学 | Fusion type word meaning embedding method based on real-time learning |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1152749A (en) * | 1996-01-30 | 1997-06-25 | 陈肇雄 | Fully automatic system for separating Chinese words from sentences |
CN101082908A (en) * | 2007-06-26 | 2007-12-05 | 腾讯科技(深圳)有限公司 | Method and system for dividing Chinese sentences |
CN101114282A (en) * | 2007-07-12 | 2008-01-30 | 华为技术有限公司 | Participle processing method and equipment |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7860707B2 (en) * | 2006-12-13 | 2010-12-28 | Microsoft Corporation | Compound word splitting for directory assistance services |
-
2014
- 2014-10-24 CN CN201410579684.5A patent/CN104462058B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1152749A (en) * | 1996-01-30 | 1997-06-25 | 陈肇雄 | Fully automatic system for separating Chinese words from sentences |
CN101082908A (en) * | 2007-06-26 | 2007-12-05 | 腾讯科技(深圳)有限公司 | Method and system for dividing Chinese sentences |
CN101114282A (en) * | 2007-07-12 | 2008-01-30 | 华为技术有限公司 | Participle processing method and equipment |
CN102402502A (en) * | 2011-11-24 | 2012-04-04 | 北京趣拿信息技术有限公司 | Word segmentation processing method and device for search engine |
Non-Patent Citations (2)
Title |
---|
《中文语音合成中的文本正则化研究》;贾玉祥等;《中文信息学报》;20080930;第22卷(第5期);第46页左栏第5段、右栏第2段以及表2,第3节第1段,第3.1节第3段 * |
《中文语音合成系统中的中文标准化方法》;陈志刚等;《中文信息学报》;20030430;第17卷(第4期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104462058A (en) | 2015-03-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104462058B (en) | Character string identification method and device | |
JP7179273B2 (en) | Translation model training methods, phrase translation methods, devices, storage media and computer programs | |
CN111261144B (en) | Voice recognition method, device, terminal and storage medium | |
KR102270394B1 (en) | Method, terminal, and storage medium for recognizing an image | |
CN109447234B (en) | Model training method, method for synthesizing speaking expression and related device | |
CN109697973B (en) | Rhythm level labeling method, model training method and device | |
CN106251869B (en) | Voice processing method and device | |
CN110827826B (en) | Method for converting words by voice and electronic equipment | |
CN108984535B (en) | Statement translation method, translation model training method, device and storage medium | |
CN108074574A (en) | Audio-frequency processing method, device and mobile terminal | |
CN109215660A (en) | Text error correction method and mobile terminal after speech recognition | |
CN110765502A (en) | Information processing method and related product | |
CN109545221B (en) | Parameter adjustment method, mobile terminal and computer readable storage medium | |
CN111476209A (en) | Method and device for recognizing handwriting input and computer storage medium | |
CN115859220A (en) | Data processing method, related device and storage medium | |
CN112329926A (en) | Quality improvement method and system for intelligent robot | |
CN110555329A (en) | Sign language translation method, terminal and storage medium | |
CN109686359B (en) | Voice output method, terminal and computer readable storage medium | |
CN111241815A (en) | Text increment method and device and terminal equipment | |
CN110619879A (en) | Voice recognition method and device | |
CN107613109B (en) | Input method of mobile terminal, mobile terminal and computer storage medium | |
CN111145734A (en) | Voice recognition method and electronic equipment | |
CN108491471B (en) | Text information processing method and mobile terminal | |
CN116127966A (en) | Text processing method, language model training method and electronic equipment | |
CN111338598B (en) | Message processing method and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |