CN104462058B - Character string identification method and device - Google Patents

Character string identification method and device Download PDF

Info

Publication number
CN104462058B
CN104462058B CN201410579684.5A CN201410579684A CN104462058B CN 104462058 B CN104462058 B CN 104462058B CN 201410579684 A CN201410579684 A CN 201410579684A CN 104462058 B CN104462058 B CN 104462058B
Authority
CN
China
Prior art keywords
substring
character string
type
word
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410579684.5A
Other languages
Chinese (zh)
Other versions
CN104462058A (en
Inventor
戴强
刘骁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410579684.5A priority Critical patent/CN104462058B/en
Publication of CN104462058A publication Critical patent/CN104462058A/en
Application granted granted Critical
Publication of CN104462058B publication Critical patent/CN104462058B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of character string identification method and devices, the described method comprises the following steps in one embodiment:Character string is obtained, the character string is made of multiple types substring;The character string is segmented according to the substring type of the multiple types substring and combinations thereof, the character string is divided at least one substring;Judge whether at least one substring is that word converges, and it is the vocabulary for having unique meaning in the affiliated languages of the substring that the word, which converges,;If it is that word converges to judge the substring not, processing is identified at least one substring;And all substrings after identification are synthesized into Connected Speech.According to the method for the embodiment of the present invention and device, the meaning of character string can be accurately identified.

Description

Character string identification method and device
Technical field
The present invention relates to field of computer technology, more particularly to a kind of character string identification method and device.
Background technology
The development of present computer technology, phonetic synthesis also occur therewith, and phonetic synthesis will arbitrary text information reality When be converted into the massage voice reading of standard smoothness and come out.This mode either content, storage, transmission or convenience, in time Property etc. all facilitate user transmit message and read message.But all there are many pronunciation, different pronunciations for a large amount of character strings Also there are different meanings, only correct pronunciation that could give expression to appropriate meaning after synthesizing voice.Therefore in phonetic synthesis When, the meaning of a word of accurate identification string is particularly important.
Invention content
In view of this, a kind of character string identification method of present invention offer and device, can accurately identify the meaning of character string.
A kind of character string identification method, the described method comprises the following steps:
Character string is obtained, the character string is made of multiple types substring;
The character string is segmented according to the substring type of a plurality of types of substrings and combinations thereof, The character string is divided at least one substring;
Judge whether at least one substring is that word converges, and it is the affiliated languages of the substring that the word, which converges, In have the vocabulary of unique meaning;
If it is that word converges to judge the substring not, processing is identified at least one substring;With And
All substrings after identification are synthesized into Connected Speech.
A kind of character string identification device, described device comprise the following modules:
Acquisition module, for obtaining character string, the character string is made of multiple types substring;
Word-dividing mode is used for the character string according to the substring class of the multiple types substring and combinations thereof Type is segmented, and the character string is divided at least one substring;
Judgment module, for judging whether at least one substring is that word converges, it is the son that the word, which converges, There is the vocabulary of unique meaning in the affiliated languages of character string;
Processing module, if being that word converges for judging the substring not, by least one substring into Row identifying processing;And
Synthesis module, for all substrings synthesis Connected Speech after identifying.
According to the method and device of above-described embodiment, by being segmented to character string according to the classification of character string, then It is identified by word, improves the accuracy of character string identification.
For the above and other objects, features and advantages of the present invention can be clearer and more comprehensible, preferred embodiment cited below particularly, And coordinate institute's accompanying drawings, it is described in detail below.
Description of the drawings
Fig. 1 is a kind of structure diagram of electronic device.
Fig. 2 is the character string identification method flow chart that first embodiment provides.
Fig. 3 is the character string identification method flow chart that second embodiment provides.
Fig. 4 is the character string identification method flow chart that 3rd embodiment provides.
Fig. 5 is the character string identification method flow chart that fourth embodiment provides.
Fig. 6 is the character string identification method flow chart that the 5th embodiment provides.
Fig. 7 is the character string identification device structure diagram that sixth embodiment provides.
Fig. 8 is the character string identification device structure diagram that the 7th embodiment provides.
Fig. 9 is the character string identification device structure diagram that the 8th embodiment provides.
Figure 10 is the character string identification device structure diagram that the 9th embodiment provides.
Figure 11 is the character string identification device structure diagram that the tenth embodiment provides.
Specific implementation mode
Further to illustrate that the present invention is the technological means and effect realized predetermined goal of the invention and taken, below in conjunction with Specific implementation mode, structure, feature and its effect according to the present invention is described in detail as after in attached drawing and preferred embodiment.
A kind of character string identification method involved in the embodiment of the present invention and device, can be used for character string in phonetic synthesis Identification, specific its can be used in electronic device.
Fig. 1 is the structure diagram of above-mentioned electronic device.As shown in Figure 1, electronic device 100 include one or more (in figure Only show one) processor 102, memory 104, RF (Radio Frequency, radio frequency) module 106, network module 108, sound Frequency module 110, input module 112, display module 114,.It will appreciated by the skilled person that structure shown in FIG. 1 is only For signal, the structure of electronic device 100 is not caused to limit.For example, electronic device 100 may also include than shown in Fig. 1 More either less components or with the configuration different from shown in Fig. 1.The specific example packet of above-mentioned electronic device 100 Include but be not limited to handheld computer, mobile phone, media player, mobile unit, personal digital assistant and aforementioned device Various combinations.
It will appreciated by the skilled person that for processor 102, every other component belongs to outer If being coupled by multiple Peripheral Interfaces 124 between processor 102 and these peripheral hardwares.Peripheral Interface 124 can be based on following standard It realizes:Universal Asynchronous Receive/sending device (Universal Asynchronous Receiver/Transmitter, UART), Universal input/output (General Purpose Input Output, GPIO), Serial Peripheral Interface (SPI) (Serial Peripheral Interface, SPI), internal integrated circuit (Inter-Integrated Circuit, I2C), but not and limit In above-mentioned standard.In some instances, Peripheral Interface 124 can only include bus;In other examples, Peripheral Interface 124 is also May include other elements, display controller such as one or more controller, such as connecting liquid crystal display panel or Storage control 122 for connecting memory.In addition, this this controller can also be detached from Peripheral Interface 124, and It is integrated in the interior or corresponding peripheral hardware of processor 102.
Memory 104 can be used for storing software program and module, as the character string identification method in the embodiment of the present invention/ Corresponding program instruction/the module of device, processor 102 are stored in software program and module in memory 104 by operation, To perform various functions application and data processing, that is, realize above-mentioned character string identification method.Memory 104 may include height Fast random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or other Non-volatile solid state memory.In some instances, memory 104 can further comprise remotely located relative to processor 102 Memory, these remote memories can pass through network connection to electronic device 100.The example of above-mentioned network includes but unlimited In internet, intranet, LAN, mobile radio communication and combinations thereof.
RF modules 106 realize the mutual conversion of electromagnetic wave and electric signal, thus with logical for receiving and transmitting electromagnetic wave News network or other equipment are communicated.RF modules 106 may include the various existing circuit elements for executing these functions Part, for example, antenna, RF transceiver, digital signal processor, encryption/deciphering chip, subscriber identity module (SIM) card, storage Device etc..RF modules 106 can carry out communication with various networks such as internet, intranet, wireless network or by wireless Network is communicated with other equipment.Above-mentioned wireless network may include cellular telephone networks, WLAN or Metropolitan Area Network (MAN). Above-mentioned wireless network can use various communication standards, agreement and technology, including but not limited to global system for mobile communications (Global System for Mobile Communication, GSM), enhanced mobile communication technology (Enhanced Data GSM Environment, EDGE), Wideband CDMA Technology (wideband code division multiple Access, W-CDMA), Code Division Multiple Access (Code division access, CDMA), time division multiple access technology (time Division multiple access, TDMA), adopting wireless fidelity technology (Wireless, Fidelity, WiFi) (such as U.S.'s electricity Gas and Electronic Engineering Association standard IEEE 802.11a, IEEE 802.11b, IEEE802.11g and/or IEEE 802.11n), the networking telephone (Voice over internet protocal, VoIP), worldwide interoperability for microwave accesses (Worldwide Interoperability for Microwave Access, Wi-Max), other for mail, Instant Messenger The agreement and any other suitable communications protocol of news and short message, or even may include that those are not developed currently yet Agreement.
Network module 108 is for receiving and transmitting network signal.Above-mentioned network signal may include wireless signal or have Line signal.In an example, above-mentioned network signal is WiFi signal, since the working frequency of WiFi is also at the frequency range of radio frequency Interior, network module can have the hardware configuration similar with RF modules 106 at this time, you can including antenna, RF transceiver, number letter The elements such as number processor, encryption/deciphering chip.In an example, above-mentioned network signal is cable network signal.At this point, net Network module 108 may include the elements such as processor, random access memory, converter, crystal oscillator.
Voicefrequency circuit 110, loud speaker, sound jack, microphone are provided jointly between user and electronic apparatus 100 Audio interface.Specifically, voicefrequency circuit 110 receives voice data from processor 102, and voice data is converted to electric signal, By electric signal transmission to loud speaker.Loud speaker 101 converts electrical signals to the sound wave that human ear can be heard.Voicefrequency circuit 110 also from Receive electric signal at microphone, convert electrical signals to voice data, and by data transmission in network telephony to processor 102 with into traveling The processing of one step.Audio data can obtain from memory 104 or by RF modules 106, network module 108.In addition, sound Frequency evidence can also be stored into memory 104 or be sent by RF modules 106 and network module 108.
Input unit 112 can be used for receiving the character information of input, and generation has with user setting and function control Keyboard, mouse, operating lever, optics or the input of trace ball signal of pass.Specifically, input unit 112 may include button and Touch-control surface.Button for example may include the character keys for inputting character, and the control button for triggering control function. The example of control button includes " returning to main screen " button, on/off button, camera button etc..Touch-control surface collects user On it or neighbouring touch operation (such as user using any suitable object or attachment such as finger, stylus in touch-control surface The upper or operation near touch-control surface), and corresponding attachment device is driven according to a pre-set procedure.Optionally, touch-control Surface may include both touch detecting apparatus and touch controller.Wherein, the touch side of touch detecting apparatus detection user Position, and the signal that touch operation is brought is detected, transmit a signal to touch controller;Touch controller is from touch detecting apparatus Touch information is received, and is converted into contact coordinate, then gives processor 102, and the order that processor 102 is sent can be received And it is executed.Furthermore, it is possible to realize touch-control table using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves Face.In addition to touch-control surface, input unit 112 can also include other input equipments.Other above-mentioned input equipments include but not It is limited to one or more in physical keyboard, trace ball, mouse, operating lever etc..
Display module 114 is used to show information input by user, is supplied to user information and electronic device 100 Various graphical user interface, these graphical user interface can be made of figure, text, icon, video and its arbitrary combination. In an example, display module 114 includes a display panel.Display panel may be, for example, a liquid crystal display panel (Liquid Crystal Display, LCD), Organic Light Emitting Diode (Organic Light-Emitting Diode Display, OLED) display panel, electrophoretic display panel (Electro-Phoretic Display, EPD) etc..Further, Touch-control surface may be disposed on display panel to constitute an entirety with display panel.In further embodiments, mould is shown Block 114 may also include other kinds of display device, such as including a projection display equipment.Compared to general display surface Plate, projection display equipment also need to include some component such as lens groups for projection.
First embodiment
Fig. 2 is a kind of character string identification method flow chart provided in this embodiment, as shown in Fig. 2, the method for the present embodiment Include the following steps:
Step S101, character string is obtained, the character string is made of multiple types substring.
The character string can be the character string inputted immediately by user, can also be existing word in Current electronic device Symbol string.In an example, the method in the present embodiment is used in a immediate communication tool, the first user terminal and second user Character string is sent between end mutually, the acquisition character string can be that the character string that current interface receives can also communication tool Character string in historical record.In another example, the method for the present embodiment can be used in a translation software, the character String can be that electronic device receives character string input by user.
It is appreciated that character string is there are many type, for example, Arabic, at noon, English, number symbol and its appoints The types such as the combination of meaning.The multiple types character string also Corresponding matching respective profiles, the configuration file is for marking The character string type to prestore, which corresponds to, determines target type.The addend word for example, number is put in marks " Number2Punction2Number " can be expressed as decimal, telephone number, numerical value etc..For example, " 2.13 ", " 010- 88888888”.Corresponding configuration is:“Number2Punction2Number:Decimal, Telephone ".Further, described The meaning that character string defines can be changed and be increased to configuration file.Such as character string " 3,247 " belongs to above-mentioned number and puts in marks Addend word " Number2Punction2Number " type, but " 3,247 " are not belonging to the class being arranged in configuration file to character string Type belongs to numerical value.Increase target type " Numerical " then can be carried out in the configuration file to above-mentioned character string type.
Step S102, by the character string according to the substring type of the multiple types substring and combinations thereof into Row participle, is divided at least one substring by the character string.
In one embodiment, character string is divided into four major class character strings:English (English) indicates Chinese character (Kanji), symbol (Punctuation), digital (Number).Above-mentioned four classes character string can also be combined arbitrarily, for example, English2Number:The type of expression English addend word, Type Length 2, such as " CA1419 "; Number2Punctuation2Number:Indicate the type of digital addend word of putting in marks, Type Length 3, for example, " 010- 88888888”;Number2Kanji:Indicate number plus the type of Chinese character, Type Length 2, for example, 2014.It can be according to English (English) indicates Chinese character (Kanji), symbol (Punctuation), digital (Number) and combinations thereof participle.
In an example, by sentence " China Mobile (0941) March 16 is in Hong Kong publication 2005 wealth year business performance " Segmented " China/movement/(/ 0941/)/March/16 day/in the wealth year of/Hong Kong/publication/2005//operation/achievement ".Further Each substring is also marked part of speech by ground when being segmented.For example, " China " mark part of speech " Kanji ", " March " mark word Property " Number2 Kanji ".By marking the part of speech of a substring, when can be used for substring identifying processing, as front and back The reference information of substring.
Step S103, judge whether at least one substring is that word converges.
It is the vocabulary for having unique meaning in the affiliated languages of the substring that the word, which converges,.Only have when being exported with spoken language Unique pronunciation.For example, it is understood that " China " has in Chinese uniquely contains if the substring is " China " Justice then can determine that character string " China " is converged for word.For example, substring " China " also has unique meaning in English, also may be used It is converged for word with judgement " China ".
In embodiments of the present invention, classified according to the four of above-mentioned character string kinds, if it is appreciated that the substring Then ambiguity can be not present under normal circumstances with its meaning of Direct Recognition for Chinese or English word.For example, " China " is in voice It directly can sequentially be understood when synthesis.Judge whether substring is English or Chinese, it then can be straight if Chinese and English word Reading is connect, the identification of meaning need not be carried out again.If not Sino-British word converges, then the deciphering for carrying out ambiguity, such as " 2001 are needed Year ", which can be understood as " in two thousand 01 ", can also be read as " in 2001 ".
If it is that word converges step S104, to judge the substring not, at least one substring is known It manages in other places.
In one embodiment, the method for the present embodiment is used for phonetic synthesis.Phonetic synthesis, also known as literary periodicals (Text to Speech) technology, the massage voice reading that can convert arbitrary text information in real time standard smoothness come out, are equivalent to Artificial face has been loaded onto to machine.For synthetic language, in addition to depending on various rules, including semantics rule, lexical rule, Phonetics rule is outer, it is necessary to is well understood by having in word, the problem of this also relates to natural language understanding.
For the character strings of combination of above-mentioned four major types, there may be the multiple meanings of ambiguity, same type of character strings It can indicate a plurality of types of contents, then the meaning to substring in current string is needed to be identified.
For example, " 120 " can indicate that ambulance call pronounces " one 20 ", it can also indicate that numerical value is pronounced " 102 ".Then It can be identified according to the meaning of front and back substring, such as in an example, " dialing 120 ambulance calls " then can basis Character string " ambulance call " judges " 120 " as telephone number below.
" Number2Punction2Number " type can be expressed as decimal, telephone number, numerical value etc..For example, " 2014 Hundred million yuan/RMB of year/China/movement/business revenue/3,247/ ", therein " 3,247 " can be according to front and back character string " hundred million yuan " It is judged as numerical value.For example, " 010-88888888 ", which is also " Number2Punction2Number " type, indicates telephone number.Example Such as, " 2014 " in above-mentioned example can indicate that " 2,014 years " can also indicate " in 2014 ".It then can basis Front and back character string information establishes Matching Model, and by model treatment, then the result of preference pattern is as final recognition result. It can be used in one example " conditional random field models (CRF models) ".The conditional random field models have undirected graph model, Vertex in figure represents stochastic variable, and the line between vertex represents the dependence relation between stochastic variable, in condition random field, with Machine variable Y is distributed as conditional probability, and given observed value is then stochastic variable X.In principle, the graph model cloth of condition random field Office can be any given, and general common layout is the framework of chain eliminant." 2014 " in above-mentioned example can basis Subsequent multiple character strings " China/movement/business revenue " are judged as " in 2014 ", rather than " 2,014 years ".
For example, having exact meaning for number plus percentage symbol, percentage is indicated.Then matched with general rule Identification.For example, number plus percentage sign indicate percentage.
For example, character string " jpg ", " gif " etc. is picture/mb-type character.Default rule can be then set, appearance is worked as " BMP ", " JPG ", " GIF ", " PNG " are then identified as picture format, can directly be understood in turn according to letter, number in character string.
Step S105, all substrings after identification are synthesized into Connected Speech.
The character string of above-mentioned identification is changed into spoken output that can listen to understand, fluent.
Further, the method for the present embodiment, can also be by the character string phonetic synthesis after identification.
It is that other character string is segmented, then is identified respectively to substring by treating according to the method for the present embodiment Processing, improves the accuracy of identification.
Second embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists In as shown in figure 3, step S104 further includes specifically:
Step S201, according to substring described in the content recognition of the corresponding front and back character string of the substring.
Step S202, the substring after identification is synthesized into voice.
The method of the present embodiment can be identified according to the substring of above or below.According to front and back substring part Ambiguity is not present in character string, then can obtain a result.
For example, " 120 " can indicate that ambulance call pronounces " one 20 ", it can also indicate that numerical value is pronounced " 102 ".Then It can be identified according to the meaning of front and back substring, such as in an example, " dialing 120 ambulance calls " then can basis Character string " ambulance call " judges " 120 " as telephone number below.For example, " 2014/China/movement/business revenue/3,247/ hundred million Member/RMB ", therein " 3,247 " can be judged as numerical value according to front and back character string " hundred million yuan ".Identify accurate result The substring currently identified is synthesized into voice again.
According to the method for the present embodiment, when processing is identified to substring, pass through the information of front and back substring The meaning for identifying substring, avoids the character string of more meanings from interfering, realizes higher accuracy rate.
3rd embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists In as shown in figure 4, step S104 further includes specifically:
Step S301, string matching model is established, the meaning of the substring is identified according to the Matching Model.
Step S302, the substring after identification is synthesized into voice.
The multiple types substring also Corresponding matching respective profiles, the configuration file are described pre- for marking The character string type deposited, which corresponds to, determines target type.Addend word " Number2Punction2Number " for example, number is put in marks It can be expressed as decimal, telephone number, numerical value etc..Corresponding configuration is:“Number2Punction2Number:Decimal, Telephone, Numerical ".It can be according to the corresponding of the character string Corresponding matching of substring corresponding types when identification string Configuration file identifies.
For example, " the hundred million yuan/RMB of/China/movement/business revenue/3,247/ in 2014 ", therein " 3,247 " can basis Front and back character string " hundred million yuan " is judged as numerical value.For example, " 010-88888888 " is also " Number2Punction2Number " Type indicates telephone number.For example, " 2014 " in above-mentioned example can indicate that " 2,014 years " can also indicate " two 1 years ".Then Matching Model can be established according to front and back character string information, by model treatment, the then result of preference pattern As final recognition result.In an example, it can be used " conditional random field models (CRF models) ".The condition random field Model has undirected graph model, and the vertex in figure represents stochastic variable, and the line between vertex represents interdependent between stochastic variable Relationship, in condition random field, stochastic variable Y's is distributed as conditional probability, and given observed value is then stochastic variable X.Principle On, the graph model layout of condition random field can be any given, and general common layout is the framework of chain eliminant.Above-mentioned example " 2014 " in son can be judged as " in 2014 " according to subsequent multiple character strings " China/movement/business revenue ", without It is " 2,014 years ".It is understood that the Matching Model can also be other statistical models, such as Hidden Markov Model (HMM model), conditional random field models (CRF models), maximum entropy model (ME models) etc..Finally by the character string of identification Synthesize voice.
According to the method for the present embodiment, according to front and back information, still there may be ambiguities for partial character string, pass through foundation With model, the character string information for comparing context identifies the meaning of current substring, to further increase character string identification Accuracy rate.
Fourth embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists In as shown in figure 5, step S104 further includes specifically:
Step S401, according to the meaning Direct Recognition of the substring.
Step S402, the substring after identification is synthesized into voice.
For example, having exact meaning for number plus percentage symbol, percentage is indicated.Then matched with general rule Identification.For example, number plus percentage sign indicate percentage.
Process resource is saved, together for there is the character string Direct Recognition of direct clear meaning according to the method for the present embodiment When also have higher accuracy rate.
5th embodiment
The present embodiment provides a kind of character string identification methods, and the present embodiment is similar with first embodiment, and difference exists In as shown in fig. 6, step S104 further includes specifically:
Step S501, it is identified according to default type according to the recognizable character string in the substring.
Step S502, the substring after identification is synthesized into voice.
There is the corresponding meaning given tacit consent to for some character strings, then the recognition rule of acquiescence can be set.
For example, character string " jpg ", " gif " etc. is picture/mb-type character.Default rule can be then set, appearance is worked as " BMP ", " JPG ", " GIF ", " PNG " are then identified as picture format, can directly be understood in turn according to letter, number in character string. When synthesizing voice then directly in order letter in composite characters string, number voice.
According to the method for the present embodiment, part special string can be directly identified according to the rule of acquiescence, it can Special rules is defined, the recognition accuracy of character string is improved.
Sixth embodiment
The present embodiment provides a kind of character string identification devices, as shown in fig. 7, the device of the present embodiment includes:Acquisition module 601, word-dividing mode 602, judgment module 603, processing module 604 and synthesis module 605.
Acquisition module 601, for obtaining character string, the character string is made of multiple types substring.
It is appreciated that character string is there are many type, for example, Arabic, at noon, English, number symbol and its appoints The types such as the combination of meaning.The multiple types character string also Corresponding matching respective profiles, the configuration file is for marking The character string type to prestore, which corresponds to, determines target type.The addend word for example, number is put in marks " Number2Punction2Number " can be expressed as decimal, telephone number, numerical value etc..For example, " 2.13 ", " 010- 88888888”.Corresponding configuration is:“Number2Punction2Number:Decimal, Telephone ".Further, described The meaning that character string defines can be changed and be increased to configuration file.Such as character string " 3,247 " belongs to above-mentioned number and puts in marks Addend word " Number2Punction2Number " type, but " 3,247 " are not belonging to the class being arranged in configuration file to character string Type belongs to numerical value.Increase target type " Numerical " then can be carried out in the configuration file to above-mentioned character string type.
Word-dividing mode 602 is used for the character string according to the sub- character of the multiple types substring and combinations thereof String type is segmented, and the character string is divided at least one substring.
In one embodiment, character string is divided into four major class character strings:English (English) indicates Chinese character (Kanji), symbol (Punctuation), digital (Number).Above-mentioned four classes character string can also be combined arbitrarily, for example, English2Number:The type of expression English addend word, Type Length 2, such as " CA1419 "; Number2Punctuation2Number:Indicate the type of digital addend word of putting in marks, Type Length 3, for example, " 010- 88888888”;Number2Kanji:Indicate number plus the type of Chinese character, Type Length 2, for example, 2014.It can be according to English (English) indicates Chinese character (Kanji), symbol (Punctuation), digital (Number) and combinations thereof participle.
In an example, by sentence " China Mobile (0941) March 16 is in Hong Kong publication 2005 wealth year business performance " Segmented " China/movement/(/ 0941/)/March/16 day/in the wealth year of/Hong Kong/publication/2005//operation/achievement ".Further Each substring is also marked part of speech by ground when being segmented.For example, " China " mark part of speech " Kanji ", " March " mark word Property " Number2 Kanji ".By marking the part of speech of a substring, when can be used for substring identifying processing, as front and back The reference information of substring.
Judgment module 603, for judging whether at least one substring is that word converges.
It is the vocabulary for having unique meaning in the affiliated languages of the substring that the word, which converges,.
In embodiments of the present invention, classified according to the four of above-mentioned character string kinds, if it is appreciated that the substring Then ambiguity can be not present under normal circumstances with its meaning of Direct Recognition for Chinese or English word.For example, " China " is in voice It directly can sequentially be understood when synthesis.Judge whether substring is English or Chinese, it then can be straight if Chinese and English word Reading is connect, the identification of meaning need not be carried out again.If not Sino-British word converges, then the deciphering for carrying out ambiguity, such as " 2001 are needed Year ", which can be understood as " in two thousand 01 ", can also be read as " in 2001 ".
Processing module 604, if being that word converges for judging the substring not, by least one substring Processing is identified.
Substring there may be ambiguity is identified, obtains accurate result.
Synthesis module 605, for all substrings synthesis Connected Speech after identifying.By the character string of above-mentioned identification It is changed into spoken output that can listen to understand, fluent.
It is that other character string is segmented by treating, at substring respectively identification according to the device of the present embodiment Reason, improves the accuracy of identification.
7th embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists In as shown in figure 8, described device further includes:
First recognition unit 6041, for son described in the content recognition according to the corresponding front and back character string of the substring Character string;
Phonetic synthesis unit 6042, for the substring synthesis voice after identifying.
The other details of device about the present embodiment can also further regard to second embodiment, be not repeated herein.
According to the device of the present embodiment, when processing is identified to substring, pass through the information of front and back substring The meaning for identifying substring, avoids the character string of more meanings from interfering, realizes higher accuracy rate.
8th embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists In as shown in figure 9, described device further includes:
Second recognition unit 6043 identifies the sub- word for establishing string matching model according to the Matching Model Accord with the meaning of string.
Phonetic synthesis unit 6042 is used for after the substring in the character string identifies, by the son after identification Character string synthesizes voice.
The other details of device about the present embodiment can also further regard to 3rd embodiment, be not repeated herein.
According to the device of the present embodiment, according to front and back information, still there may be ambiguities for partial character string, pass through foundation With model, the character string information for comparing context identifies the meaning of current substring, to further increase character string identification Accuracy rate.
9th embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists In as shown in Figure 10, described device further includes:
Third recognition unit 6044, for the meaning Direct Recognition according to the substring.
Phonetic synthesis unit 6042, for the substring synthesis voice after identifying.
The other details of device about the present embodiment can also further regard to fourth embodiment, be not repeated herein.
Process resource is saved, together for there is the character string Direct Recognition of direct clear meaning according to the device of the present embodiment When also have higher accuracy rate.
Tenth embodiment
The present embodiment provides a kind of character string identification device, the present embodiment is similar with the 7th embodiment, and difference exists In as shown in figure 11, described device further includes:
4th recognition unit 6045, for being carried out according to default type according to the recognizable character string in the substring Identification.
Phonetic synthesis unit 6042, for the substring synthesis voice after identifying.
The other details of device about the present embodiment can also further regard to the 5th embodiment, be not repeated herein.
According to the device of the present embodiment, part special string can be directly identified according to the rule of acquiescence, it can Special rules is defined, the recognition accuracy of character string is improved.
In addition, the embodiment of the present invention also provides a kind of computer readable storage medium, it is executable to be stored with computer Instruction, above-mentioned computer readable storage medium is, for example, nonvolatile memory such as CD, hard disk or flash memory.It is above-mentioned Computer executable instructions for allowing computer or similar arithmetic unit to complete in above-mentioned character string identification method Various operations.
The above described is only a preferred embodiment of the present invention, be not intended to limit the present invention in any form, though So the present invention has been disclosed with preferred embodiment as above, and however, it is not intended to limit the invention, any those skilled in the art, not It is detached within the scope of technical solution of the present invention, when the technology contents using the disclosure above make a little change or are modified to equivalent change The equivalent embodiment of change, as long as being without departing from technical solution of the present invention content, according to the technical essence of the invention to implementing above Any simple modification, equivalent change and modification made by example, in the range of still falling within technical solution of the present invention.

Claims (8)

1. a kind of character string identification method, which is characterized in that the described method comprises the following steps:
Character string is obtained, the character string is made of multiple types substring, and the multiple types substring includes:English Type, numeric type, sign pattern, Chinese character type and combinations thereof, the multiple types substring Corresponding matching corresponding configuration File, the configuration file are used to mark the character string type to prestore to correspond to and determine target type;
The character string is segmented according to the substring type of the multiple types substring and combinations thereof, it will be described Character string is divided at least one substring, each substring is marked part of speech when being segmented, the part of speech is for indicating The type of each substring;
Judge whether at least one substring is that word converges, and it is to have in the affiliated languages of the substring that the word, which converges, The vocabulary of unique meaning;
If it is that word converges to judge the substring not, processing is identified at least one substring;And
All substrings after identification are synthesized into Connected Speech;
The described substring is identified specifically includes:
String matching model is established according to front and back character string information, containing for the substring is identified according to the Matching Model Justice selects the handling result of the Matching Model as recognition result;
The substring after identification is synthesized into voice.
2. character string identification method as described in claim 1, which is characterized in that described that tool is identified in the substring Body includes:
According to substring described in the content recognition of the corresponding front and back character string of the substring;
The substring after identification is synthesized into voice.
3. character string identification method as described in claim 1, which is characterized in that described that tool is identified in the substring Body includes:
According to the meaning Direct Recognition of the substring;
The substring after identification is synthesized into voice.
4. character string identification method as described in claim 1, which is characterized in that described that tool is identified in the substring Body includes:
It is identified according to default type according to the recognizable character string in the substring;
The substring after identification is synthesized into voice.
5. a kind of character string identification device, which is characterized in that described device comprises the following modules:
Acquisition module, for obtaining character string, the character string is made of multiple types substring, the sub- word of multiple types Symbol is gone here and there:English type, numeric type, sign pattern, Chinese character type and combinations thereof, the multiple types substring correspond to Respective profiles are matched, the configuration file is used to mark the character string type to prestore to correspond to and determines target type;
Word-dividing mode, for by the character string according to the substring type of the multiple types substring and combinations thereof into Row participle, is divided at least one substring by the character string, and each substring is marked part of speech when being segmented, described Part of speech is used to indicate the type of each substring;
Judgment module, for judging whether at least one substring is that word converges, it is the sub- character that the word, which converges, There is the vocabulary of unique meaning in languages belonging to string;
Processing module knows at least one substring if being that word converges for judging the substring not It manages in other places;And
Synthesis module, for all substrings synthesis Connected Speech after identifying;
The processing module specifically includes:
Second recognition unit is known for establishing string matching model according to front and back character string information according to the Matching Model The meaning of the not described substring, selects the handling result of the Matching Model as recognition result;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
6. character string identification device as claimed in claim 5, which is characterized in that the processing module specifically includes:
First recognition unit, for substring described in the content recognition according to the corresponding front and back character string of the substring 's;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
7. character string identification device as claimed in claim 5, which is characterized in that the processing module specifically includes:
Third recognition unit, for the meaning Direct Recognition according to the substring;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
8. character string identification device as claimed in claim 5, which is characterized in that the processing module specifically includes:
4th recognition unit, for being identified according to default type according to the recognizable character string in the substring;
Phonetic synthesis unit, for the substring synthesis voice after identifying.
CN201410579684.5A 2014-10-24 2014-10-24 Character string identification method and device Active CN104462058B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410579684.5A CN104462058B (en) 2014-10-24 2014-10-24 Character string identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410579684.5A CN104462058B (en) 2014-10-24 2014-10-24 Character string identification method and device

Publications (2)

Publication Number Publication Date
CN104462058A CN104462058A (en) 2015-03-25
CN104462058B true CN104462058B (en) 2018-10-02

Family

ID=52908128

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410579684.5A Active CN104462058B (en) 2014-10-24 2014-10-24 Character string identification method and device

Country Status (1)

Country Link
CN (1) CN104462058B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156120B (en) * 2015-04-07 2020-02-28 阿里巴巴集团控股有限公司 Method and device for classifying character strings
CN104881503A (en) * 2015-06-24 2015-09-02 郑州悉知信息技术有限公司 Data processing method and device
CN105653517A (en) * 2015-11-05 2016-06-08 乐视致新电子科技(天津)有限公司 Recognition rate determining method and apparatus
CN107705784B (en) * 2017-09-28 2020-09-29 百度在线网络技术(北京)有限公司 Text regularization model training method and device, and text regularization method and device
CN109359274B (en) * 2018-09-14 2023-05-02 蚂蚁金服(杭州)网络技术有限公司 Method, device and equipment for identifying character strings generated in batch
CN109857898A (en) * 2019-02-20 2019-06-07 成都嗨翻屋科技有限公司 A kind of method and system of mass digital audio-frequency fingerprint storage and retrieval
CN110047464A (en) * 2019-03-29 2019-07-23 联想(北京)有限公司 Information processing method and device
CN110705274B (en) * 2019-09-06 2023-03-24 电子科技大学 Fusion type word meaning embedding method based on real-time learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1152749A (en) * 1996-01-30 1997-06-25 陈肇雄 Fully automatic system for separating Chinese words from sentences
CN101082908A (en) * 2007-06-26 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences
CN101114282A (en) * 2007-07-12 2008-01-30 华为技术有限公司 Participle processing method and equipment
CN102402502A (en) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 Word segmentation processing method and device for search engine

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7860707B2 (en) * 2006-12-13 2010-12-28 Microsoft Corporation Compound word splitting for directory assistance services

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1152749A (en) * 1996-01-30 1997-06-25 陈肇雄 Fully automatic system for separating Chinese words from sentences
CN101082908A (en) * 2007-06-26 2007-12-05 腾讯科技(深圳)有限公司 Method and system for dividing Chinese sentences
CN101114282A (en) * 2007-07-12 2008-01-30 华为技术有限公司 Participle processing method and equipment
CN102402502A (en) * 2011-11-24 2012-04-04 北京趣拿信息技术有限公司 Word segmentation processing method and device for search engine

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《中文语音合成中的文本正则化研究》;贾玉祥等;《中文信息学报》;20080930;第22卷(第5期);第46页左栏第5段、右栏第2段以及表2,第3节第1段,第3.1节第3段 *
《中文语音合成系统中的中文标准化方法》;陈志刚等;《中文信息学报》;20030430;第17卷(第4期);全文 *

Also Published As

Publication number Publication date
CN104462058A (en) 2015-03-25

Similar Documents

Publication Publication Date Title
CN104462058B (en) Character string identification method and device
JP7179273B2 (en) Translation model training methods, phrase translation methods, devices, storage media and computer programs
CN111261144B (en) Voice recognition method, device, terminal and storage medium
KR102270394B1 (en) Method, terminal, and storage medium for recognizing an image
CN109447234B (en) Model training method, method for synthesizing speaking expression and related device
CN109697973B (en) Rhythm level labeling method, model training method and device
CN106251869B (en) Voice processing method and device
CN110827826B (en) Method for converting words by voice and electronic equipment
CN108984535B (en) Statement translation method, translation model training method, device and storage medium
CN108074574A (en) Audio-frequency processing method, device and mobile terminal
CN109215660A (en) Text error correction method and mobile terminal after speech recognition
CN110765502A (en) Information processing method and related product
CN109545221B (en) Parameter adjustment method, mobile terminal and computer readable storage medium
CN111476209A (en) Method and device for recognizing handwriting input and computer storage medium
CN115859220A (en) Data processing method, related device and storage medium
CN112329926A (en) Quality improvement method and system for intelligent robot
CN110555329A (en) Sign language translation method, terminal and storage medium
CN109686359B (en) Voice output method, terminal and computer readable storage medium
CN111241815A (en) Text increment method and device and terminal equipment
CN110619879A (en) Voice recognition method and device
CN107613109B (en) Input method of mobile terminal, mobile terminal and computer storage medium
CN111145734A (en) Voice recognition method and electronic equipment
CN108491471B (en) Text information processing method and mobile terminal
CN116127966A (en) Text processing method, language model training method and electronic equipment
CN111338598B (en) Message processing method and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant