WO2017166626A1 - 归一化方法、装置和电子设备 - Google Patents

归一化方法、装置和电子设备 Download PDF

Info

Publication number
WO2017166626A1
WO2017166626A1 PCT/CN2016/096673 CN2016096673W WO2017166626A1 WO 2017166626 A1 WO2017166626 A1 WO 2017166626A1 CN 2016096673 W CN2016096673 W CN 2016096673W WO 2017166626 A1 WO2017166626 A1 WO 2017166626A1
Authority
WO
WIPO (PCT)
Prior art keywords
word
normalized
result
word dictionary
dictionary
Prior art date
Application number
PCT/CN2016/096673
Other languages
English (en)
French (fr)
Inventor
周蕾蕾
Original Assignee
乐视控股(北京)有限公司
乐视致新电子科技(天津)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视致新电子科技(天津)有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017166626A1 publication Critical patent/WO2017166626A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • G06F40/157Transformation using dictionaries or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • Embodiments of the present invention relate to the field of natural language processing technologies, and in particular, to a normalization method, apparatus, and electronic device.
  • the digital part of the speech recognition result includes Arabic numerals, uppercase Chinese characters, decimals, scores, etc.
  • the recognition result is difficult to control, and often we do not want the result, so the recognition result needs to be normalized. Make the display beautiful and convenient for subsequent semantic analysis. For example, the recognition result "2005” is normalized to "2005", and "12:1" is normalized to "12:15".
  • the embodiment of the invention provides a normalization method and device for solving the defect that the normalization result of the prior art completely depends on the normalized mapping table, and implements a fast and flexible normalization process.
  • An embodiment of the present invention provides a normalization method, including:
  • the preset normalized mapping table is queried, and the normalized result of the normalized target is obtained.
  • An embodiment of the present invention provides a normalization apparatus, including:
  • a parsing module configured to obtain an input sentence, and parse the input sentence to obtain an application scenario corresponding to the input statement
  • a word cutting module configured to obtain an input sentence, and call a domain word dictionary corresponding to a different application scenario that is generated in advance to perform a word segmentation
  • a matching module configured to invoke, according to the result of the word-cutting, a normalized grammar corresponding to the different application scenarios, and perform semantic matching with the result of the word-cutting;
  • a query module configured to query a preset when determining that the result of the word cut includes a normalized target
  • the normalized mapping table obtains the normalized result of the normalized target.
  • Embodiments of the invention further disclose an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor The instruction is executed by the at least one processor, so that the at least one processor can acquire an input sentence, and invoke a pre-generated domain word dictionary corresponding to different application scenarios to cut the input sentence; a result of the word cutting, calling a preset normalization grammar corresponding to the different application scenarios, and performing semantic matching with the result of the word cutting; when determining that the result of the word cutting includes a normalized target, Then, the preset normalized mapping table is queried, and the normalized result of the normalized target is obtained.
  • the present invention also discloses a non-volatile computer storage medium, wherein the storage medium stores computer-executable instructions that, when executed by an electronic device, enable the electronic device to: acquire an input statement, invoke a pre-generated domain word dictionary corresponding to different application scenarios is used to cut words in the input sentence; and according to the result of the word cutting, a preset normalization grammar corresponding to the different application scenarios is invoked, and The result of the cut word is semantically matched; when it is determined that the result of the cut word includes the normalized target, the preset normalized mapping table is queried, and the normalized result of the normalized target is obtained.
  • Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer The computer is caused to perform the method described above.
  • the normalization method, device and electronic device provided by the embodiment of the present invention determine the application scenario corresponding to the input sentence and call the corresponding domain word dictionary according to the application scenario to cut the word into the input sentence, thereby
  • the normalized grammar and the preset normalization mapping table normalize the input sentence, and change the normalized result when the normalization process is performed in the prior art. Completely rely on the defects of the normalized mapping form to achieve fast and flexible normalization.
  • Embodiment 1 is a technical flowchart of Embodiment 1 of the present application.
  • FIG. 2a is a diagram showing an example of a normalized scenario according to Embodiment 1 of the present application.
  • FIG. 2b is a diagram showing an example of a normalized syntax tree according to Embodiment 1 of the present application.
  • 2c is a schematic diagram showing a normalization of the first embodiment of the present application.
  • Embodiment 3 is a technical flowchart of Embodiment 2 of the present application.
  • FIG. 4 is a diagram showing an example of address information in the second embodiment of the present application.
  • FIG. 5 is a diagram showing an example of a dictionary component of the second embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a device according to Embodiment 3 of the present application.
  • FIG. 7 is a schematic structural diagram of hardware of an electronic device according to an embodiment of the present invention.
  • connection or integral connection; may be mechanical connection or electrical connection; may be directly connected, may also be indirectly connected through an intermediate medium, or may be internal communication of two components, may be wireless connection, or may be wired connection.
  • connection or integral connection; may be mechanical connection or electrical connection; may be directly connected, may also be indirectly connected through an intermediate medium, or may be internal communication of two components, may be wireless connection, or may be wired connection.
  • Embodiment 1 is a technical flowchart of Embodiment 1 of the present application.
  • a normalization method in the embodiment of the present application can be implemented by the following steps:
  • Step S110 Acquiring an input sentence, and calling a domain word dictionary corresponding to the different application scenarios generated in advance to perform a word cutting on the input sentence;
  • Step S120 Calling, according to the result of the word-cutting, a normalized grammar corresponding to the different application scenarios, and performing semantic matching with the result of the word-cutting;
  • Step S130 When it is determined that the result of the word cut includes a normalized target, the preset normalized mapping table is queried, and the normalized result of the normalized target is obtained.
  • the input sentence may be a text result corresponding to the recognition of the input voice of the user, or may be a text input result directly by the user.
  • the domain word dictionary corresponding to the application scenario is pre-trained. For a word or a number, each corresponding application scenario has a corresponding word dictionary dedicated to this application field, thereby inputting a sentence The word can be cut according to the grammar rules of the field in which it is located, so that the normalized target part can be correctly extracted.
  • the input sentence is cut by using the domain word dictionary without the application scenario. Can get a variety of application scenarios.
  • Mango Taiwan is a word that can represent Hunan TV.
  • the possible result is “ I want to see
  • the same word-cutting dictionary is used to cut the result of the user's speech recognition, and the result of the word-cutting is uncontrollable. If the result of the word cut is "I want to see
  • step S120 the result obtained by segmenting the previous step is semantically matched, and the purpose is to find whether the statement input by the user includes a target that needs to be normalized, such as writing of a number and a synonym.
  • semantic matching when semantic matching is performed, semantic matching is performed by using a corresponding normalized grammar for the word-cutting result of the domain dictionary of the different application scenarios. Therefore, the correctness of semantic matching can be guaranteed.
  • the embodiment of the invention adopts a semantic parsing grammar based on BNF grammar, and expands on the basis of the above, and adds key functions such as digital extraction.
  • BNF Backus-Naur Form
  • Bacchus paradigm is a grammar that uses formal symbols to describe a given language.
  • Double_quote is used to represent double quotes.
  • Words outside the double quotes (possibly underlined) represent the grammar part.
  • ⁇ > contains a mandatory option, which is a non-terminal node whose syntax must be further explained;
  • the braces ( ⁇ ) contain items that can be repeated from 0 to innumerable.
  • name is written at the beginning of the normalized grammar, indicating the normalized grammar name, and can also indicate the type of the normalized grammar and the application scenario.
  • the following part takes the digital application scenario as an example.
  • the name of the corresponding grammar file is “age digital normalization”:
  • DigitNormalize.dic The contents of DigitNormalize.dic are as follows:
  • the &norm MappingTable.dict
  • MappingTable.dict the &norm function is mainly used to extract any form of the input sentence and the synonym of the heterogeneous alien.
  • step S130 if the input sentence of the user includes a normalized target such as a number appearing in any form and a synonym of a heterogeneous alien in the semantic matching of the previous step, according to the normalization target
  • the target query queries a pre-established normalized mapping table, and obtains an alternative to the normalized target from the normalized mapping table, that is, a normalized result, thereby inputting the input language
  • the sentence is updated, and the updated normalized result is sent to the next step.
  • the normalized mapping table in this step a plurality of the application scenarios may be used to share a normalized mapping table, or a normalized mapping table may be separately set for each of the application scenarios.
  • the embodiment of the invention is not limited thereto.
  • different application scenarios can be designed for different usage environments.
  • the digital application scenario may include: year, year, month, day, time, currency, phone number, Scores, scores, decimals, episodes, age, car weeks, etc. It should be understood that the application scenarios of the foregoing figures are used for example only, and are not limited to the embodiments of the present invention.
  • the digital normalization mapping table DigitNormalize.dic is stored in the syntax tree in the form of a hash table as a hash type node.
  • the hash table contains the key and value, which is the key and value.
  • the left side of the DigitNormalize.dic medium is used as the key, and the right side is the value.
  • the recognition sentence is "playing a movie in the 1990s”
  • the result of the word cut is "sowing
  • the key is mapped to value, and the normalized result is finally output, as shown in FIG. 5.
  • the input scenario is determined by determining an application scenario corresponding to the input sentence, and the corresponding domain word dictionary is invoked according to the application scenario, so as to perform a word according to the preset normalization syntax and the preset normalization.
  • the mapping table normalizes the input sentence, and changes the normalization result in the prior art, and the normalized result completely depends on the defect of the normalized mapping table, thereby realizing rapid and flexible normalization. deal with.
  • Step S310 Obtain a general word dictionary according to the language model obtained in advance training
  • Step S320 calculating an average probability value of all words in the universal word dictionary
  • Step S330 Obtain a probability value of the terminating word of the normalized grammar corresponding to each of the application scenarios in the universal word-cut dictionary;
  • Step S340 In the domain word dictionary of the application scenario, assign the value to the terminating word to generate a domain word dictionary of the application scenario.
  • a general word-cut dictionary is first obtained through the language model.
  • the dictionary format is as shown in FIG. 4 and FIG. 5, and the dictionary includes two parts, an address information part and a dictionary component.
  • the address information part includes 10 Arabic numerals, 26 uppercase English letters (numbers and uppercase letters are in full-width format, occupying two bytes) and address information of phrases corresponding to 6768 commonly used Chinese characters, corresponding to each word.
  • the dictionary component stores the phrase of the Chinese character corresponding to the address area.
  • the full-width letter “0” is taken as an example.
  • the address corresponding to “0” is “27216”, so in the phrase area, “0” corresponds to
  • the address of the phrase area is "uniDict+27216".
  • the group word with the first word of "0” can be: "05 mm”, when we need to find the word with the first word "0” in the dictionary, Look down from the address "uniDict+27216" until you encounter the border guard tag. In this way, all group words are divided into regions according to the first word, which can greatly improve the search efficiency of the dictionary, and the phrase part does not need to store the first word, thereby saving the space of the dictionary.
  • Wordlen the length of the phrase
  • Reclen sizeof(reclen)+sizeof(wordlen)+sizeof(buf)+sizeof(frequency);
  • step S320 an average probability of all words in the universal word dictionary is calculated and recorded as meanF.
  • step S330 for each application scenario, a probability value of each of the corresponding finalized words in the universal word-cut dictionary is calculated.
  • chronological number normalization traversing the normalized grammar to obtain all the final words, namely: one, two, three, four, five , six, seven, eight, nine, ten, zero, and years. For each terminating word, go to the general word dictionary to find, if found in the general word dictionary, obtain its corresponding probability value, denoted as fi, where i is a positive number greater than or equal to 0, Indicates the number of terminating words.
  • step 340 a domain word dictionary specific to each of the application scenarios is established, and a probability value of the terminating words in the normalized grammar corresponding to each application scenario is assigned a value of fi.
  • the probability of the term "one” is 0.2
  • the value of "one” is also assigned by 0.2.
  • the average probability value meanF calculated in step S320 is in the application scenario.
  • the field word dictionary is used to assign a value to the terminating word to update the domain word dictionary of the application scene.
  • the same domain word dictionary is obtained in the same format as the general word dictionary, but it only contains the terminology in the current normalization grammar.
  • the TV dictionary in the field of words only contains TV collars.
  • the terminology of the normalized grammar of the domain and the cut-word dictionary of the music domain only contain the terminology of the normalized grammar of the music field.
  • the domain word dictionary is generated accordingly. Therefore, when the input sentence is cut, the correct word result can be obtained through the corresponding domain word dictionary. For example, suppose the word "age” does not exist in the universal word dictionary, that is, its probability is 0. According to the general word dictionary, the word cutting method will be chopped into "year
  • the general word-cut dictionary is updated to generate a domain word dictionary without the application scene, so that the input sentence is applied to the scene and the corresponding domain word dictionary is used to cut the input sentence, thereby further improving the word.
  • the correctness of the word and the normalized result at the same time, it is necessary to increase the corresponding normalization syntax and the normalized mapping table mapping when the application scenario needs to be added, and the target table can be shared by multiple scenarios, and the use is flexible. Easy to maintain.
  • a normalization device includes a word cutting module 61, a matching module 62, a query module 63, and a preprocessing module 64.
  • the word-cutting module 61 is configured to acquire an input sentence, and invoke a pre-generated domain word dictionary corresponding to different application scenarios to perform a word-cutting on the input sentence;
  • the matching module 62 is configured to, according to the result of the word-cutting, invoke a preset normalization grammar corresponding to the different application scenarios, and perform semantic matching with the result of the word-cutting;
  • the querying module 63 is configured to: when determining a normalized target in the result of the word-cutting, query a preset normalized mapping table to obtain a normalized result of the normalized target.
  • the normalization target includes digital writing and/or synonym.
  • the device further includes a pre-processing module 64, the pre-processing module 64 is configured to: classify the application scenarios of the normalized targets, and define corresponding normalization according to each of the application scenarios.
  • the grammar and the corresponding normalized mapping table are configured to: classify the application scenarios of the normalized targets, and define corresponding normalization according to each of the application scenarios.
  • the pre-processing module 64 is further configured to: generate a domain word dictionary of each of the application scenarios according to the normalization syntax and a pre-acquired common word-cut dictionary.
  • the pre-processing module 64 is specifically configured to: obtain a general word-cut dictionary according to a pre-trained language model; calculate an average probability value of all words in the universal word-cut dictionary; and obtain corresponding to each of the application scenarios a probability value of the terminator of the normalized grammar in the universal word-cut dictionary; in the domain word-cut dictionary of the application scenario, the probability value is assigned to the terminator to generate the A domain word dictionary for the application scene.
  • the pre-processing module 64 is further configured to: if the probability value of the terminating word in the universal word-cut dictionary is 0, the average probability value is in a domain word dictionary of the application scenario.
  • the terminating word is assigned to update the domain word dictionary of the application scenario.
  • the apparatus shown in FIG. 6 can perform the method of the embodiment shown in FIG. 1 to FIG. 5, and the implementation principle and technical effects refer to the embodiment shown in FIG. 1 to FIG. 5, and details are not described herein again.
  • an embodiment of the present invention further discloses an electronic device including at least one processor 810; and a memory 800 communicably connected to the at least one processor 810; wherein the memory 800 stores An instruction executed by the at least one processor 810, the instructions being executed by the at least one processor 810 to enable the at least one processor 810 to acquire an input statement, invoking a pre-generated field corresponding to a different application scenario Cutting the word dictionary to perform a word-cutting on the input sentence; according to the result of the word-cutting, calling a preset normalization grammar corresponding to the different application scenarios, and performing semantic matching with the result of the word-cutting; If the result of the word cut includes a normalized target, the preset normalized mapping table is queried, and the normalized result of the normalized target is obtained.
  • the electronic device also includes an input device 830 and an output device 840 that are electrically coupled to the memory 800 and the processor, the electrical connections preferably being connected by a bus.
  • the normalization target includes digital writing and/or synonym.
  • the method further includes: classifying the application scenario of the normalized target, and defining a corresponding normalized grammar according to each of the application scenarios And the corresponding normalized mapping table.
  • the method further comprises: generating a domain word dictionary of each of the application scenarios according to the normalization grammar and a pre-fetched universal word dictionary.
  • the electronic device of the embodiment preferably, the method, generating a domain word dictionary for each of the application scenarios, specifically comprising: acquiring a general word dictionary according to a pre-trained language model; and calculating the universal cut An average probability value of all words in the word dictionary; obtaining a probability value of the terminology of the normalized grammar corresponding to each of the application scenarios in the universal word dictionary; cutting words in the domain of the application scenario In the dictionary, the terminology is assigned with the probability value to generate a domain word dictionary of the application scenario.
  • the method further includes: if the probability word has a probability value of 0 in the universal word dictionary, the average probability value is cut in a field of the application scenario.
  • the term dictionary is assigned a value to the terminology word to update the domain word dictionary of the application scenario.
  • Embodiments of the present invention also disclose a non-volatile computer storage medium, wherein the storage medium stores computer-executable instructions that, when executed by an electronic device, enable the electronic device to: acquire an input sentence And translating the pre-generated domain word dictionary corresponding to different application scenarios to perform the word segmentation; and according to the result of the word segmentation, calling a preset normalization grammar corresponding to the different application scenarios, and The result of the word-cutting is semantically matched; when it is determined that the result of the word-cutting includes a normalized target, the preset normalized mapping table is queried, and the normalized result of the normalized target is obtained.
  • the normalization target includes digital writing and/or synonyms.
  • the storage medium of the embodiment preferably, before acquiring the input sentence, the method further includes: classifying the application scenario of the normalized target, and defining according to each of the application scenarios The corresponding normalized syntax and the corresponding normalized mapping table.
  • the method further comprises: generating a domain word dictionary of each of the application scenarios according to the normalization grammar and a pre-acquired common word dictionary.
  • the storage medium of the embodiment preferably, the method, generating a domain word dictionary of each of the application scenarios, specifically comprising: acquiring a general word dictionary according to a pre-trained language model; and calculating the universal cut An average probability value of all words in the word dictionary; obtaining a probability value of the terminology of the normalized grammar corresponding to each of the application scenarios in the universal word dictionary; cutting words in the domain of the application scenario In the dictionary, the terminology is assigned with the probability value to generate a domain word dictionary of the application scenario.
  • the method further includes: if the terminology word has a probability value of 0 in the general word-cut dictionary, the average probability value is cut in a field of the application scenario.
  • the term dictionary is assigned a value to the terminology word to update the domain word dictionary of the application scenario.
  • Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer
  • the computer is caused to perform the method described in the above embodiments.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more of the flow or in a block or blocks of a flow diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种归一化方法、装置和电子设备。获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词(110);根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配(120);当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果(130)。实现了快速而灵活的归一化处理。

Description

归一化方法、装置和电子设备
交叉引用
本申请要求在2016年03月30日提交中国专利局、申请号为201610193023.8、发明名称为“归一化方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明实施例涉及自然语言处理技术领域,尤其涉及一种归一化方法、装置和电子设备。
背景技术
对于语音识别结果进行解析时,语音识别结果中的数字部分包括阿拉伯数字、大写汉字、小数、分数等,识别结果难以控制,经常出现我们不想要的结果,所以需要对识别结果进行归一化,使得显示美观且方便后续进行语义解析。比如将识别结果“二零零五年”归一化成“2005年”,将“十二点一刻”归一化成“12:15”等。
另外,对于很多异形同义词,其表述含义相同,然而在语音识别过程中,也经常出现并不符合用户意图的识别结果。例如,用户通过带有语音识别的电视进行节目搜索时,用户的语音输入为“我想看芒果台”,然而,在电视的语音识别设备中,预先可能并没有存有“芒果台”这一电视频道关键词,因此,对用户的语音输入的识别结果可能会出错,也许会得到许多跟“芒果”有关的电视节目。因此,需要在识别之前进一步将异形同义词进行归一化,例如,将“芒果台”归一化至“湖南台”,从而,不论用户的语音输入结果是“芒果台”还是“湖南台”都能准确识别用户意图,并为用户提 供相应服务。
目前比较主流的归一化方案都是对待归一化的目标进行简单的映射,这种办法归一化结果完全依赖归一化映射表包含的内容,十分不灵活,需要人工维护,并且归一化结果蛮力生硬,容易出错。
因此,一种改进的归一化方法亟待提出。
发明内容
本发明实施例提供一种归一化方法及装置,用以解决现有技术归一化结果完全依赖归一化映射表式的缺陷,实现快速而灵活的归一化处理。
本发明实施例提供一种归一化方法,包括:
获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
本发明实施例提供一种归一化装置,包括:
解析模块,用于获取输入语句,对所述输入语句进行解析从而获得所述输语句对应的应用场景;
切词模块,用于获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
匹配模块,用于根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
查询模块,用于当判定所述切词的结果中包含归一化目标,则查询预设 的归一化映射表,获取所述归一化目标的归一化结果。
本发明实施例又公开了一种电子设备,包括至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
本发明还公开了一种非易失性计算机存储介质,其中,所述存储介质存储有计算机可执行指令,所述计算机可执行指令当由电子设备执行时使得电子设备能够:获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述的方法。
本发明实施例提供的归一化方法、装置和电子设备,通过判断输入语句对应的应用场景并根据所述应用场景调用相应的领域切词词典对所述输入语句进行切词,从而根据预设的归一化语法以及预设的归一化映射表对所述输入语句进行归一化处理,改变了现有技术中进行归一化处理时,归一化结果 完全依赖归一化映射表式的缺陷,实现快速而灵活的归一化处理。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例一的技术流程图;
图2a为本申请实施例一的归一化场景示例图;
图2b为本申请实施例一的归一化语法树示例图;
图2c为本申请实施例一的归一化示例图;
图3为本申请实施例二的技术流程图;
图4为本申请实施例二的地址信息部分示例图;
图5为本申请实施例二的词典组成部分示例图;
图6为本申请实施例三的装置结构示意图;
图7为本发明实施例中电子设备的硬件结构示意图。
具体实施方式
下面将结合附图对本发明的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在本发明的描述中,需要说明的是,术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本发明和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的 方位构造和操作,因此不能理解为对本发明的限制。此外,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。
在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,还可以是两个元件内部的连通,可以是无线连接,也可以是有线连接。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。
此外,下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。
图1是本申请实施例一的技术流程图,结合图1,本申请实施例一种归一化方法,可由如下的步骤实现:
步骤S110:获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
步骤S120:根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
步骤S130:当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
具体的,在步骤S110中,所述输入语句可以是用户输入语音的识别对应的文字结果,也可以是用户直接的文字输入结果。所述应用场景相应的所述领域切词词典是预先训练的到的。对于一个词或者一个数字,其对应的每一种应用场景都相应的有一个专属于这一应用领域的切词词典,从而输入语句 能够按照其所处领域的语法规则来进行切词,从而能够正确提取归一化目标部分。本步骤中,对于一输入语句而言,并无法得知其中的归一化目标属于哪一应用场景,因此,本步骤中,需要采用不用应用场景的领域切词词典对输入语句进行切词,能够得到多种应用场景。
例如,在电视应用领域,“芒果台”是一个能够代表湖南电视台的词,在用电视应用领域的领域切词词典对“我想看芒果台”进行切词的时候,可能得到的结果是“我想|看|芒果台”;而在食物领域,“芒果”就是一种水果,所以,在用食物领域的领域切词词典对“我想看芒果台”进行切词的时候,可能得到的结果是“我想|看|芒果|台”。
根据现有技术中的做法,对于每一个字或者词的不同应用领域,共同采用同一个切词词典对用户语音识别的结果进行切词,切词结果是不可控的。若是切词结果为“我想|看|芒果|台”,那么相应的语义解析以及搜索结果,可能并不是用户想要的湖南台,也许是与“芒果”这一水果有关的任意节目。因此,本发明实施例中,将通用的切词词典,按照用户输入语句的应用场景进行分类,从而得到每个应用场景分类对应的领域切词词典,切词结果可控且语义匹配的正确率更高。
具体的,在步骤S120中,对上一步骤所切分得到的结果进行语义匹配,其目的在于,寻找用户输入的语句中是否包含需要进行归一化的目标,例如数字的书写以及同义词。
本步骤中,在进行语义匹配时,针对所述不同应用场景的领域词典的切词结果,采用对应的归一化语法进行语义匹配。由此,才能够保证语义匹配的正确性。
例如,“我想|看|芒果|台”是用电视应用场景的邻域切词词典进行切词得到的结果,需要用电视应用场景的归一化语法进行语义匹配,那么将能够得到正确的匹配结果,即“我想看湖南台”。“我想|看|芒果|台”是食物应用 领域切词词典进行切词得到的结果,若是用食物场景的归一化语法进行语义匹配,则“看”和“台”,在食物应用领域的归一化语法中是匹配不上的,因此,在这个应用场景,用户的输入没有匹配结果。
本发明实施例采用基于BNF语法的语义解析语法,并在其基础上进行了扩展,增加数字提取等关键函数。BNF(Backus-Naur Form),即巴科斯范式,是一种用形式化符号来描述给定语言的语法。
现有技术中已有BNF语法有如下规则:
在双引号中的字("word")代表着这些字符本身。而double_quote用来代表双引号。
在双引号外的字(有可能有下划线)代表着语法部分。
<>:内包含的为必选项,是语法必须进一步解释的非终结节点;
[]:内包含的为可选项,表示其内容可以跳过;
|:表示在其左右两边任选一项,相当于"或"的意思;
():表示组合;
大括号({})内包含的为可重复0至无数次的项。
::=是“被定义为”的意思。
本发明实施例中采用的语法规则在BNF的基础之上进行了扩展,具体加了如下规则:
#:表示注释;
::非终结节点与其解释的分隔符;
;:表示语法中语句的结束;
“”:表示引用外部词典文件;
&root(<name>):写在归一化语法的开始部分,表示该归一化语法的名字为name;
&norm(“MappingTable.dict”):是归一化方法最重要的函数,它用来提取输入文本的归一化目标部分,并查找映射表MappingTable.dict,从而以归一化的结果对归一化目标进行替换。
在本发明实施例中,&root(<name>)中,name写在归一化语法的开始部分,表示所述归一化语法名字,也能够表示所述归一化语法的种类以及应用场景。
以下部分以数字的年代应用场景为例,按照语法规则书写好如下语法文件,对应的语法文件的名字是“年代数字归一化”:
&root(<年代数字归一化>);
<年代数字归一化>:<一到九><零>年代;
<一到九>:&norm(“DigitNormalize.dic”);
<零>:&norm(“DigitNormalize.dic”);
其中DigitNormalize.dic内容如下:
一=1
二=2
三=3
四=4
五=5
六=6
七=7
八=8
九=9
十=0
零=0
映射成哈希表,得到:
key=一 value=1
key=二 value=2
key=三 value=3
key=四 value=4
key=五 value=5
key=六 value=6
key=七 value=7
key=八 value=8
key=九 value=9
key=十 value=0
key=零 value=0
在本发明实施例中,&norm(“MappingTable.dict”)函数主要用于提取输入语句中的任何形式出现的数字以及异构异形的同义词。
具体的,在步骤S130中,若是在上一步的语义匹配中,发现用户的所述输入语句中包含以任何形式出现的数字、以及异构异形的同义词等归一化目标,则根据这些归一化目标查询预先建立的归一化映射表,并从所述归一化映射表中获取所属归一化目标的替代项,即归一化结果,从而对所述输入语 句进行更新,将更新后的归一化结果送至下一步操作。
需要说明的是,本步骤中的所述归一化映射表,可以是多个所述应用场景公用一个归一化映射表,也可以是每个所述应用场景单独设置一个归一化映射表,本发明实施例并不限制于此。所述归一化映射表中,针对不同的使用环境,可以设计不同的应用场景,如图2a所示,数字的应用场景可以包括:年份、年代、月份、日、时间、货币、电话号码、比分、分数、小数、剧集、年龄、车次星期等。当然,应当理解,上述数字的应用场景仅供举例使用,对本发明实施例并不构成限制。
为方便语义匹配,需要将所有归一语法编译成语法树,最终输出一个归一语法森林。上述“年代数字归一化语法”编译成语法树如图2b所示:
其中数字归一化映射表DigitNormalize.dic以哈希表的形式存放在语法树中,作为哈希类型的节点。哈希表中含有键和值,也就是key和value,DigitNormalize.dic中等号左边的作为key,右边作为value,匹配成功时将识别结果中的key映射成value。
比如识别语句为“播放九十年代的电影”,切词结果为“播|放|九|十|年代的|电影”,将这一待匹配语句与所有数字归一语法树进行匹配,提取数字部分,将key映射为value,最终输出归一化结果,具体如图5所示。
本实施例中,通过判断输入语句对应的应用场景并根据所述应用场景调用相应的领域切词词典对所述输入语句进行切词,从而根据预设的归一化语法以及预设的归一化映射表对所述输入语句进行归一化处理,改变了现有技术中进行归一化处理时,归一化结果完全依赖归一化映射表式的缺陷,实现快速而灵活的归一化处理。
图3是本申请实施二的技术流程图,结合图3,本申请实施例一种归一化方法中,生成每一应用场景的所述领域切词词典的过程可进一步由以下步 骤实现:
步骤S310:根据预先训练得到的语言模型获取通用的切词词典;
步骤S320:计算所述通用的切词词典中所有词的平均概率值;
步骤S330:获取每一所述应用场景对应的所述归一化语法的终结词在所述通用的切词词典中的概率值;
步骤S340:在所述应用场景的领域切词词典中,以所述概率值为所述终结词赋值从而生成所述应用场景的领域切词词典。
具体的,步骤S310中,首先通过语言模型得到一个通用的切词词典,词典格式如图4以及图5所示,词典包含两部分,地址信息部分和词典组成部分。
其中,地址信息部分包含10个阿拉伯数字、26个大写英文字母(数字和大写字母都用全角格式,占用两个字节)和6768个常用汉字所对应的词组的地址信息,每个字相应的地址用4个字节保存,并且按汉字GB2312的编码顺序排列,所以地址部分占用大小为:(10+26+6768)*4=27216字节。因此,如果词典的首地址为uniDict,那么词组区域首地址:uniDict+27216。
其中,词典组成部分存储的是地址区域对应的汉字的词组,比如以全角字母“0”为例,在地址区域,“0”对应的地址是“27216”,所以在词组区域,“0”对应的词组区域的地址为“uniDict+27216”,可以看到,以“0”为首字的组词可以为:“05毫米”,当我们需要在字典里查找以“0”为首字的词时,从地址“uniDict+27216”开始向下查找即可,直到遇到边界guard标记。如此,所有组词按首字划分区域,可以大大提高字典的查找效率,并且词组部分不需要存储首字,从而节省了字典的空间。
词典种包含的每个参数含义如下:
wordlen:词组的长度;
buf:去掉首字的词组内容,sizeof(buf)=wordlen-2字节;
frequency:由一元模型概率转换得到的词频,sizeof(frequency)=2字节;
reclen:存储一个词占用的总空间,sizeof(reclen)=1字节,
reclen=sizeof(reclen)+sizeof(wordlen)+sizeof(buf)+sizeof(frequency);
guard:代表每个分区的结束,sizeof(guard)=1字节。
具体的,在步骤S320中,计算所述通用的切词词典中所有词的平均概率,记为meanF。
具体的,在步骤S330中,针对每一种应用场景,计算其对应的归一化语法中,每一个终结词在所述通用的切词词典中的概率值。
承接上一实施例中步骤S130的归一化语法———“年代数字归一化”为例,遍历所述归一化语法,得到所有终结词,即:一、二、三、四、五、六、七、八、九、十、零、年代。对于每个终结词都去所述通用的切词词典中查找,如果在所述通用的切词词典中找到,获取其对应的概率值,记为fi,其中i为大于等于0的正数,表示终结词的个数。
具体的,在步骤340,建立每个所述应用场景专属的领域切词词典,并将每一种应用场景对应的归一化语法中的终结词的概率值以fi赋值。
例如,在所述通用的切词词典中,终结词“一”的概率是0.2,那么在新建立的年代领域切词词典中,也以0.2给“一”赋值。
需要说明的是,若是某一归一化语法中的终结词在所述通用的切词词典中概率值为0,则以步骤S320中计算出的所述平均概率值meanF在所述应用场景的领域切词词典中为所述终结词赋值从而更新所述应用场景的领域切词词典。
由此,得到格式与通用的切词词典相同的领域切词词典,但是它只包含当前归一化语法中有的终结词。例如,电视领域的切词词典中只包含电视领 域的归一化语法的终结词、音乐领域的切词词典中只包含音乐领域的归一化语法的终结词。
对于每个应用领域的归一化语法,都相应的生成领域切词词典。由此,在对输入语句进行切词的时候通过相应的所述领域切词词典可以得到正确想要的切词结果。比如假设“年代”一词在通用的切词词典中是不存在的,也就是说它的概率是0,按照通用的切词词典的切词方法会被切碎成“年|代”,但是如果用领域切词词典,由于其概率赋值为meanF,就不会被切碎。
本实施例中,通过将通用的切词词典更新生成不用应用场景的领域切词词典,以使得判定输入语句应用场景之后采用相应的领域切词词典对所述输入语句进行切词,进一步提高了切词以及归一化结果的正确性;与此同时,需要增加应用场景时增加相应的归一化语法和归一化映射表映射即可达到目的,并且多个场景可以共用映射表,使用灵活,方便维护。
图6是本申请实施例三的装置结构示意图,结合图6,本申请一种归一化装置,包括、切词模块61、匹配模块62、查询模块63以及预处理模块64。
所述切词模块61,用于获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
所述匹配模块62,用于根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
所述查询模块63,用于当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
其中,所述归一化目标包括数字书写和/或同义词。
其中,所述装置还包括预处理模块64,所述预处理模块64用于:对所述归一化目标的应用场景进行分类,并根据每一所述应用场景定义相应的归一 化语法以及相应的归一化映射表。
其中,所述预处理模块64还用于:根据所述归一化语法以及预先获取的通用的切词词典生成每一所述应用场景的领域切词词典。
其中,所述预处理模块64具体用于:根据预先训练得到的语言模型获取通用的切词词典;计算所述通用的切词词典中所有词的平均概率值;获取每一所述应用场景对应的所述归一化语法的终结词在所述通用的切词词典中的概率值;在所述应用场景的领域切词词典中,以所述概率值为所述终结词赋值从而生成所述应用场景的领域切词词典。
其中,所述预处理模块64还用于,若所述终结词在所述通用的切词词典中概率值为0,则以所述平均概率值在所述应用场景的领域切词词典中为所述终结词赋值从而更新所述应用场景的领域切词词典。
图6所示装置可以执行图1~图5所示实施例的方法,实现原理和技术效果参考图1~图5所示实施例,不再赘述。
如图7所示,本发明实施例又公开了一种电子设备,包括至少一个处理器810;以及,与所述至少一个处理器810通信连接的存储器800;其中,所述存储器800存储有可被所述至少一个处理器810执行的指令,所述指令被所述至少一个处理器810执行,以使所述至少一个处理器810能够获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。所述电子设备还包括与所述存储器800和所述处理器电连接的输入装置830和输出装置840,所述电连接优选为通过总线连接。
本实施例的电子设备,优选地,所述归一化目标包括数字书写和/或同义词。
本实施例的电子设备,优选地,在获取输入语句之前,所述方法还包括:对所述归一化目标的应用场景进行分类,并根据每一所述应用场景定义相应的归一化语法以及相应的归一化映射表。
本实施例的电子设备,优选地,所述方法还包括:根据所述归一化语法以及预先获取的通用的切词词典生成每一所述应用场景的领域切词词典。
本实施例的电子设备,优选地,所述方法,生成每一所述应用场景的领域切词词典,具体包括:根据预先训练得到的语言模型获取通用的切词词典;计算所述通用的切词词典中所有词的平均概率值;获取每一所述应用场景对应的所述归一化语法的终结词在所述通用的切词词典中的概率值;在所述应用场景的领域切词词典中,以所述概率值为所述终结词赋值从而生成所述应用场景的领域切词词典。
本实施例的电子设备,优选地,所述方法还包括,若所述终结词在所述通用的切词词典中概率值为0,则以所述平均概率值在所述应用场景的领域切词词典中为所述终结词赋值从而更新所述应用场景的领域切词词典。
本发明实施例还公开了一种非易失性计算机存储介质,其中,所述存储介质存储有计算机可执行指令,所述计算机可执行指令当由电子设备执行时使得电子设备能够:获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
本实施例的存储介质,优选地,所述归一化目标包括数字书写和/或同义词。
本实施例的存储介质,优选地,在获取输入语句之前,所述方法还包括:对所述归一化目标的应用场景进行分类,并根据每一所述应用场景定义 相应的归一化语法以及相应的归一化映射表。
本实施例的存储介质,优选地,所述方法还包括:根据所述归一化语法以及预先获取的通用的切词词典生成每一所述应用场景的领域切词词典。
本实施例的存储介质,优选地,所述方法,生成每一所述应用场景的领域切词词典,具体包括:根据预先训练得到的语言模型获取通用的切词词典;计算所述通用的切词词典中所有词的平均概率值;获取每一所述应用场景对应的所述归一化语法的终结词在所述通用的切词词典中的概率值;在所述应用场景的领域切词词典中,以所述概率值为所述终结词赋值从而生成所述应用场景的领域切词词典。
本实施例的存储介质,优选地,所述方法还包括,若所述终结词在所述通用的切词词典中概率值为0,则以所述平均概率值在所述应用场景的领域切词词典中为所述终结词赋值从而更新所述应用场景的领域切词词典。
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述实施例所述的方法。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算 机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引伸出的显而易见的变化或变动仍处于本发明创造的保护范围之中。

Claims (15)

  1. 一种归一化方法,其特征在于,包括如下的步骤:
    获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
    根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
    当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
  2. 根据权利要求1所述的方法,其特征在于,所述归一化目标包括数字书写和/或同义词。
  3. 根据权利要求2所述的方法,其特征在于,在获取输入语句之前,所述方法还包括:
    对所述归一化目标的应用场景进行分类,并根据每一所述应用场景定义相应的归一化语法以及相应的归一化映射表。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    根据所述归一化语法以及预先获取的通用的切词词典生成每一所述应用场景的领域切词词典。
  5. 根据权利要求4所述的方法,其特征在于,所述方法,生成每一所述应用场景的领域切词词典,具体包括:
    根据预先训练得到的语言模型获取通用的切词词典;
    计算所述通用的切词词典中所有词的平均概率值;
    获取每一所述应用场景对应的所述归一化语法的终结词在所述通用的切词词典中的概率值;
    在所述应用场景的领域切词词典中,以所述概率值为所述终结词赋值从而生成所述应用场景的领域切词词典。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括,
    若所述终结词在所述通用的切词词典中概率值为0,则以所述平均概率值在所述应用场景的领域切词词典中为所述终结词赋值从而更新所述应用场景的领域切词词典。
  7. 一种归一化装置,其特征在于,包括如下的模块:
    切词模块,用于获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
    匹配模块,用于根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
    查询模块,用于当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
  8. 根据权利要求7所述的装置,其特征在于,所述归一化目标包括数字书写和/或同义词。
  9. 根据权利要求8所述的装置,其特征在于,所述装置还包括预处理模块,所述预处理模块用于:
    对所述归一化目标的应用场景进行分类,并根据每一所述应用场景定义相应的归一化语法以及相应的归一化映射表。
  10. 根据权利要求9所述的装置,其特征在于,所述预处理模块还用于:
    根据所述归一化语法以及预先获取的通用的切词词典生成每一所述应用场景的领域切词词典。
  11. 根据权利要求10所述的装置,其特征在于,所述预处理模块具体用于:
    根据预先训练得到的语言模型获取通用的切词词典;
    计算所述通用的切词词典中所有词的平均概率值;
    获取每一所述应用场景对应的所述归一化语法的终结词在所述通用的切词词典中的概率值;
    在所述应用场景的领域切词词典中,以所述概率值为所述终结词赋值从而生成所述应用场景的领域切词词典。
  12. 根据权利要求11所述的装置,其特征在于,所述预处理模块还用于,
    若所述终结词在所述通用的切词词典中概率值为0,则以所述平均概率值在所述应用场景的领域切词词典中为所述终结词赋值从而更新所述应用场景的领域切词词典。
  13. 一种电子设备,其特征在于包括至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够
    获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
    根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
    当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
  14. 一种非易失性计算机存储介质,其特征在于:所述存储介质存储有计算机可执行指令,所述计算机可执行指令当由电子设备执行时使得电子设备能够:
    获取输入语句,调用预先生成的、不同应用场景对应的领域切词词典对所述输入语句进行切词;
    根据所述切词的结果,调用预先设置的、所述不同应用场景对应的归一化语法,与所述切词的结果进行语义匹配;
    当判定所述切词的结果中包含归一化目标,则查询预设的归一化映射表,获取所述归一化目标的归一化结果。
  15. [根据细则26改正29.09.2016] 
    一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,其特征在于,当所述程序指令被计算机执行时,使所述计算机执行权利要求1-6所述的方法。
PCT/CN2016/096673 2016-03-30 2016-08-25 归一化方法、装置和电子设备 WO2017166626A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610193023.8A CN105843797A (zh) 2016-03-30 2016-03-30 归一化方法及装置
CN201610193023.8 2016-03-30

Publications (1)

Publication Number Publication Date
WO2017166626A1 true WO2017166626A1 (zh) 2017-10-05

Family

ID=56584311

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/096673 WO2017166626A1 (zh) 2016-03-30 2016-08-25 归一化方法、装置和电子设备

Country Status (2)

Country Link
CN (1) CN105843797A (zh)
WO (1) WO2017166626A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859905A (zh) * 2019-04-03 2020-10-30 北京嘀嘀无限科技发展有限公司 一种数据确定方法、装置、电子设备和计算机存储介质
CN112820295A (zh) * 2020-12-29 2021-05-18 华人运通(上海)云计算科技有限公司 语音处理装置和系统以及云端服务器和车辆
CN115826991A (zh) * 2023-02-14 2023-03-21 江西曼荼罗软件有限公司 软件脚本生成方法、系统、计算机及可读存储介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105843797A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 归一化方法及装置
CN107590124B (zh) * 2017-09-06 2020-12-04 耀灵人工智能(浙江)有限公司 按场景对同义词替换并根据按场景归类的标准词组比对的方法
CN109841210B (zh) * 2017-11-27 2024-02-20 西安中兴新软件有限责任公司 一种智能操控实现方法及装置、计算机可读存储介质
CN108961396A (zh) * 2018-07-03 2018-12-07 百度在线网络技术(北京)有限公司 三维场景的生成方法、装置及终端设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101097573A (zh) * 2006-06-28 2008-01-02 腾讯科技(深圳)有限公司 一种自动问答系统及方法
US7930181B1 (en) * 2002-09-18 2011-04-19 At&T Intellectual Property Ii, L.P. Low latency real-time speech transcription
CN102646100A (zh) * 2011-02-21 2012-08-22 腾讯科技(深圳)有限公司 领域词获取方法及系统
CN103730129A (zh) * 2013-11-18 2014-04-16 长江大学 一种用于数据库信息查询的语音查询系统
CN104699809A (zh) * 2015-03-20 2015-06-10 广东睿江科技有限公司 一种优选词库的控制方法及装置
CN104965853A (zh) * 2015-05-11 2015-10-07 腾讯科技(深圳)有限公司 聚合类应用的推荐、多方推荐源聚合的方法、系统和装置
CN105843797A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 归一化方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101193069A (zh) * 2006-12-13 2008-06-04 腾讯科技(深圳)有限公司 信息查询系统、即时通信机器人服务器及信息查询方法
CN102955772B (zh) * 2011-08-17 2015-11-25 北京百度网讯科技有限公司 一种基于语义的相似度计算方法和装置
CN102955833B (zh) * 2011-08-31 2015-11-25 深圳市华傲数据技术有限公司 一种通讯地址识别、标准化的方法
CN105224622A (zh) * 2015-09-22 2016-01-06 中国搜索信息科技股份有限公司 面向互联网的地名地址提取与标准化方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7930181B1 (en) * 2002-09-18 2011-04-19 At&T Intellectual Property Ii, L.P. Low latency real-time speech transcription
CN101097573A (zh) * 2006-06-28 2008-01-02 腾讯科技(深圳)有限公司 一种自动问答系统及方法
CN102646100A (zh) * 2011-02-21 2012-08-22 腾讯科技(深圳)有限公司 领域词获取方法及系统
CN103730129A (zh) * 2013-11-18 2014-04-16 长江大学 一种用于数据库信息查询的语音查询系统
CN104699809A (zh) * 2015-03-20 2015-06-10 广东睿江科技有限公司 一种优选词库的控制方法及装置
CN104965853A (zh) * 2015-05-11 2015-10-07 腾讯科技(深圳)有限公司 聚合类应用的推荐、多方推荐源聚合的方法、系统和装置
CN105843797A (zh) * 2016-03-30 2016-08-10 乐视控股(北京)有限公司 归一化方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859905A (zh) * 2019-04-03 2020-10-30 北京嘀嘀无限科技发展有限公司 一种数据确定方法、装置、电子设备和计算机存储介质
CN112820295A (zh) * 2020-12-29 2021-05-18 华人运通(上海)云计算科技有限公司 语音处理装置和系统以及云端服务器和车辆
CN115826991A (zh) * 2023-02-14 2023-03-21 江西曼荼罗软件有限公司 软件脚本生成方法、系统、计算机及可读存储介质
CN115826991B (zh) * 2023-02-14 2023-05-09 江西曼荼罗软件有限公司 软件脚本生成方法、系统、计算机及可读存储介质

Also Published As

Publication number Publication date
CN105843797A (zh) 2016-08-10

Similar Documents

Publication Publication Date Title
US10956464B2 (en) Natural language question answering method and apparatus
WO2017166626A1 (zh) 归一化方法、装置和电子设备
CN107451153B (zh) 输出结构化查询语句的方法和装置
KR102417045B1 (ko) 명칭을 강인하게 태깅하는 방법 및 시스템
US10896222B1 (en) Subject-specific data set for named entity resolution
KR101762866B1 (ko) 구문 구조 변환 모델과 어휘 변환 모델을 결합한 기계 번역 장치 및 기계 번역 방법
CN103377239B (zh) 计算文本间相似度的方法和装置
US10997223B1 (en) Subject-specific data set for named entity resolution
US20110060584A1 (en) Error correction using fact repositories
US20120072204A1 (en) Systems and methods for normalizing input media
US20160124936A1 (en) Grammar compiling methods, semantic parsing methods, devices, computer storage media, and apparatuses
JP2013196374A (ja) 文章校正装置、及び文章校正方法
US9798776B2 (en) Systems and methods for parsing search queries
CN105760359B (zh) 问句处理系统及其方法
WO2017161749A1 (zh) 一种信息匹配方法及装置
WO2023040493A1 (zh) 事件检测
US10853569B2 (en) Construction of a lexicon for a selected context
JP2015088064A (ja) テキスト要約装置、方法、及びプログラム
CN111859950A (zh) 一种自动化生成讲稿的方法
US9904674B2 (en) Augmented text search with syntactic information
KR101356417B1 (ko) 병렬 말뭉치를 이용한 동사구 번역 패턴 구축 장치 및 그 방법
JP5426292B2 (ja) 意見分類装置およびプログラム
CN102609410A (zh) 规范文档辅助写作系统及规范文档生成方法
US10678827B2 (en) Systematic mass normalization of international titles
CN114970541A (zh) 文本语义理解方法、装置、设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16896396

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16896396

Country of ref document: EP

Kind code of ref document: A1