WO2017166649A1 - 语音信号处理方法及装置 - Google Patents

语音信号处理方法及装置 Download PDF

Info

Publication number
WO2017166649A1
WO2017166649A1 PCT/CN2016/096984 CN2016096984W WO2017166649A1 WO 2017166649 A1 WO2017166649 A1 WO 2017166649A1 CN 2016096984 W CN2016096984 W CN 2016096984W WO 2017166649 A1 WO2017166649 A1 WO 2017166649A1
Authority
WO
WIPO (PCT)
Prior art keywords
parsing result
parsing
voice signal
fixed sentence
text data
Prior art date
Application number
PCT/CN2016/096984
Other languages
English (en)
French (fr)
Inventor
王育军
Original Assignee
乐视控股(北京)有限公司
乐视致新电子科技(天津)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 乐视控股(北京)有限公司, 乐视致新电子科技(天津)有限公司 filed Critical 乐视控股(北京)有限公司
Publication of WO2017166649A1 publication Critical patent/WO2017166649A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the present invention relates to the field of voice recognition technology, and in particular, to a voice signal processing method and apparatus.
  • speech recognition With the development of speech recognition technology, more and more applications based on speech recognition, such as voice dialing, voice navigation, voice playback control, voice information retrieval and so on. In speech recognition-based applications, it is necessary to perform semantic analysis on the speech signal, extract the user intent expressed by the speech signal, and convert it into a structured data format that the machine can understand.
  • the prior art mainly uses a preset semantic parsing template to match a speech-recognized string to perform semantic analysis on the speech signal.
  • This method requires a sufficient number of semantic parsing templates, but in fact the number of semantic parsing templates is limited, and the expression of speech signals is various, so there are often cases where the matching cannot be accurately matched, resulting in the inability to accurately parse the speech signal.
  • the semantics are often cases where the matching cannot be accurately matched, resulting in the inability to accurately parse the speech signal.
  • the invention provides a speech signal processing method and device, which are used for semantic analysis of a speech signal and improve the accuracy of semantic parsing.
  • the embodiment of the invention provides a voice signal processing method, including:
  • An embodiment of the present invention provides another voice signal processing method, including:
  • the intermediate parsing result is obtained by the server converting the entity word in the initial parsing result into a pinyin stream, where the intermediate parsing result includes the initial parsing result a fixed sentence pattern and a pinyin stream into which the entity word is converted;
  • the pinyin stream in the intermediate parsing result is corrected by using a local information base to obtain a final parsing result.
  • the embodiment of the invention provides a voice signal processing device, which is implemented at a server, and the device includes:
  • a receiving module configured to receive a voice signal sent by the client
  • a voice recognition module configured to perform voice recognition on the voice signal to obtain text data
  • a semantic parsing module configured to perform fixed sentence semantic parsing on the text data to obtain an initial parsing result including a fixed sentence form and an entity word;
  • a conversion module configured to convert an entity word in the initial parsing result into a pinyin stream to obtain an intermediate parsing result
  • a sending module configured to send the intermediate parsing result to the client, where The client uses the local information base to correct the pinyin stream in the intermediate analysis result to obtain a final analysis result.
  • An embodiment of the present invention provides another voice signal processing apparatus, which is implemented at a client, where the apparatus includes:
  • a sending module configured to send a voice signal to the server, where the server performs semantic analysis on the voice signal
  • a receiving module configured to receive an intermediate parsing result returned by the server, where the intermediate parsing result is obtained by the server converting the entity word in the initial parsing result into a pinyin stream, where the intermediate parsing result includes the a fixed sentence pattern in the initial parsing result and a pinyin stream into which the entity word is converted;
  • a correction module configured to correct the pinyin stream in the intermediate parsing result by using a local information base to obtain a final parsing result.
  • the embodiment of the present invention further provides a non-transitory computer readable storage medium, wherein the non-transitory computer readable storage medium stores computer executable instructions for executing the above voice signal processing method. .
  • An embodiment of the present invention further provides an electronic device, including: one or more processors; and a memory; wherein the memory stores instructions executable by the one or more processors, the instructions being It is set to perform the above-described voice signal processing method.
  • Embodiments of the present invention also provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions, when the program instructions are executed by a computer
  • the computer is caused to perform the above-described speech signal processing method.
  • the voice signal processing method and device perform semantic analysis on the text data corresponding to the voice signal by using a fixed sentence semantic analysis method on the server side, and obtain an initial analysis result including a fixed sentence pattern and an entity word, and the initial The entity word in the parsing result is converted into a pinyin stream to obtain an intermediate parsing result, and the intermediate parsing result is sent to the client, in the guest The client uses the local information base to correct the pinyin stream in the received intermediate analysis result to obtain the final analysis result.
  • the embodiment of the invention combines the server-side parsing with the client-side correction, and fully exerts the role of the client-side local information base on the semantic parsing of the partial entity words, and corrects the result that the server cannot accurately parse, thereby improving the accuracy of the semantic parsing. Degree, while helping to reduce the number of semantic parsing templates for server-side storage.
  • FIG. 1 is a schematic flowchart of a voice signal processing method according to an embodiment of the present invention
  • FIG. 2 is a schematic flowchart of a voice signal processing method according to another embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a voice signal processing apparatus according to another embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a voice signal processing apparatus according to another embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • a preset semantic parsing template and a speech recognized character are mainly used.
  • the string is matched to perform semantic analysis on the speech signal.
  • This method requires a sufficient number of semantic parsing templates, but in fact the number of semantic parsing templates is limited, and the expression of speech signals is various, so there are often cases where the matching cannot be accurately matched, resulting in the inability to accurately parse the speech signal.
  • the semantics Taking voice dialing as an example, there may be a semantic parsing template corresponding to the voice signal “Please call Zhang San”, but there may be no semantic parsing template for the voice signal “Please call Li Si”, so “Please call Li Si "The semantics of the voice signal "please call Li Si" may not be prepared.
  • the main principle is: semantically parsing text data corresponding to a voice signal by using a fixed sentence semantic analysis method on the server side, and obtaining an initial including a fixed sentence pattern and an entity word.
  • the result of the analysis is that the uncertainty of the entity word is high. Therefore, the entity word in the initial analysis result is converted into a pinyin stream to obtain an intermediate analysis result, and the intermediate analysis result is sent to the client, and on the client side, the local information database pair is utilized.
  • the Pinyin stream in the received intermediate analysis result is corrected to obtain the final analysis result.
  • the embodiment of the invention combines the server-side parsing with the client-side correction, and fully exerts the role of the client-side local information base on the semantic parsing of the partial entity words, and corrects the result that the server cannot accurately parse, thereby improving the accuracy of the semantic parsing. Degree, while helping to reduce the number of semantic parsing templates for server-side storage.
  • FIG. 1 is a schematic flowchart diagram of a voice signal processing method according to an embodiment of the present invention. As shown in Figure 1, the method includes:
  • the embodiment provides a voice signal processing method, which can be executed by a voice signal processing device for semantically analyzing a voice signal to improve the accuracy of semantic analysis.
  • the method provided in this embodiment is applicable to various application scenarios that require semantic analysis of voice signals, such as voice dialing, voice navigation, voice playback control, voice information retrieval, and the like.
  • the voice signal processing device can be implemented in a server end in each application scenario.
  • the client collects the voice signal of the user, for example, records the voice of the user, thereby acquiring the voice signal of the user, and then transmitting the voice signal to the server, specifically, the voice sent to the server.
  • Signal processing device receives the voice signal transmitted by the client.
  • the voice signal before the client sends the voice signal, the voice signal may be subjected to analog-to-digital conversion, encoding, compression, and the like.
  • the voice signal processing device may perform decompression, decoding, and the like processing on the voice signal, and perform semantic analysis processing on the processed voice signal.
  • the speech signal processing device can perform speech recognition on the speech signal to obtain text data. For example, if the voice signal input by the user is “I want to call Zhang San”, the voice signal can be recognized as corresponding text data.
  • the voice signal input by the user is “I want to call Zhang San”
  • the voice signal can be recognized as corresponding text data.
  • the speech signal processing apparatus may perform fixed sentence semantic parsing on the text data to obtain an initial parsing result including the fixed sentence form and the entity word.
  • the fixed sentence semantic analysis in this embodiment is different from the general semantic analysis in the prior art.
  • the general semantic parsing refers to a scheme of matching the text data by using a preset universal semantic parsing template to obtain semantics corresponding to the text data.
  • the fixed sentence semantic analysis in this embodiment refers to a scheme in which a predetermined fixed sentence parsing template is matched with text data to obtain semantics corresponding to the text data.
  • the fixed sentence parsing template includes a fixed expression portion and a pending expression portion. Minute.
  • the fixed expression part is relatively fixed, and generally does not change in different requests of the same application scenario, and the pending expression part is not fixed, and often changes in different requests of the same application scenario.
  • the voice signal processing device may specifically use the preset fixed sentence parsing template to match the text data corresponding to the voice signal to obtain a fixed sentence parsing template in the text data matching, and the text data is convenient for description.
  • the fixed sentence parsing template in the match is called the target fixed sentence parsing template.
  • the target fixed sentence parsing template also includes a fixed expression portion and a pending expression portion.
  • the speech signal processing device uses the content of the fixed expression part in the corresponding target fixed sentence parsing template in the text data as a fixed sentence pattern in the initial parsing result, and parses the undetermined expression part in the corresponding target fixed sentence parsing template in the text data.
  • the content is used as the entity word in the initial analysis result.
  • the target fixed sentence analysis template matched with the text data is “Please search for the lyrics of the song xxx”
  • the fixed expression portion may be "Please search for the lyrics of the song" as a fixed sentence in the initial analysis result, and "childhood” as the entity word in the initial analysis result.
  • the text data may be subjected to general semantic analysis.
  • the voice signal processing device may first match the text data by using a preset universal semantic parsing template; if the universal semantic parsing template is not matched, the speech signal processing device may continue to perform fixed sentence semantic parsing on the text data. Obtain initial parsing results including fixed sentences and entity words.
  • the parsing result of the text data is obtained according to the universal semantic parsing template in the matching and returned to the client, so that the client performs the corresponding operation according to the parsing result.
  • the initial parsing result is not directly returned to the client as in the prior art.
  • the speech recognition is erroneous.
  • "Zhang San” in the speech signal can be identified as "Zhang Umbrella”, in order to improve the recognition result of the entity word.
  • the speech signal processing device converts the entity words in the initial parsing result into a pinyin stream, for example, converts "Zhang Umbrella” into "zhang san", thereby obtaining an intermediate parsing result.
  • the initial analysis result is “Please call the umbrella”, and the intermediate analysis result after the Pinyin stream conversion is “Please Call zhang san”.
  • the speech signal processing device transmits the intermediate analysis result to the client.
  • the intermediate analysis result sent by the voice signal processing device is received, and the pinyin stream in the intermediate analysis result is corrected by using the local information base to obtain the final analysis result.
  • the client can match the pinyin stream in the local information base. For example, a minimum edit distance matching algorithm can be used to obtain an entity word corresponding to the pinyin stream, and then the pinyin stream is replaced by the entity word to obtain a final parsing result. .
  • the local information base of the client is actually a database related to the application scenario to which the client belongs.
  • the local information database may be an address book, if the client belongs to a voice playback control scenario.
  • the local repository can be a local music library.
  • the entity word with strong uncertainty is converted into a pinyin stream and sent to the client, which is beneficial for the client to accurately determine the entity word corresponding to the pinyin stream according to the information database related to the local application scenario, and improve the final analysis result. Accuracy.
  • the server since the entity word with strong uncertainty is determined by the client according to the specific application scenario, the server only needs to store the fixed sentence parsing template, instead of storing the generic corresponding to each entity word in the prior art. Semantic parsing templates help reduce the number of parsing templates.
  • FIG. 2 is a schematic flowchart diagram of a method for processing a voice signal according to another embodiment of the present invention. As shown in Figure 2, the method includes:
  • the embodiment provides a voice signal processing method, which can be implemented by a voice signal processing device.
  • Line used to semantically analyze the speech signal to improve the accuracy of semantic analysis.
  • the method provided in this embodiment is applicable to various application scenarios that require semantic analysis of voice signals, such as voice dialing, voice navigation, voice playback control, voice information retrieval, and the like.
  • the voice signal processing device can be implemented by a client in each application scenario.
  • the voice signal processing device collects the voice signal of the user, for example, records the voice of the user, thereby acquiring the voice signal of the user, and then transmitting the voice signal to the server for the voice signal of the server. Perform semantic analysis.
  • the voice signal processing device waits to receive the intermediate analysis result returned by the server, and after receiving the intermediate analysis result, uses the local information database to correct the pinyin stream in the intermediate analysis result to obtain the final Analyze the results.
  • the voice signal processing device matches the pinyin stream in the local information base to obtain an entity word corresponding to the pinyin stream; and combines the fixed sentence pattern in the intermediate analysis result with the entity word corresponding to the pinyin stream to obtain a final analysis. result.
  • the voice signal processing apparatus may use a minimum edit distance matching algorithm to match the pinyin stream in the local information base to obtain an entity word corresponding to the pinyin stream.
  • the local information base of the voice signal processing device is actually a information base related to the application scenario.
  • the local information base may be an address book.
  • the local information base may It is a local music library, a local video library, and so on.
  • the voice signal processing apparatus may perform corresponding operations according to the final analysis result, for example, performing dialing control according to the final analysis result, or performing playback control according to the final analysis result, or performing search according to the final analysis result.
  • the speech signal processing device can directly reject the corresponding request of the user, for example, Refuse to dial, refuse to play songs, or refuse to search for lyrics.
  • the voice signal processing device may perform processing by interacting with the user, for example, outputting prompt information to the user for the user to determine whether to continue. Perform the appropriate action and perform the appropriate action according to the user's instructions.
  • the voice signal processing device cooperates with the server to accurately identify the pinyin stream corresponding to the entity word with high uncertainty according to the information library related to the application scenario, which is beneficial to improving the final analysis result.
  • the accuracy while helping to reduce the number of analytical templates.
  • FIG. 3 is a schematic structural diagram of a voice signal processing apparatus according to still another embodiment of the present invention.
  • the device is implemented in the server.
  • the apparatus includes: a receiving module 31, a voice recognition module 32, a semantic parsing module 33, a converting module 34, and a transmitting module 35.
  • the receiving module 31 is configured to receive a voice signal sent by the client.
  • the voice recognition module 32 is configured to perform voice recognition on the voice signal to obtain text data.
  • the semantic parsing module 33 is configured to perform fixed sentence semantic parsing on the text data to obtain an initial parsing result including the fixed sentence form and the entity word.
  • the conversion module 34 is configured to convert the entity words in the initial parsing result into a pinyin stream to obtain an intermediate parsing result.
  • the sending module 35 is configured to send the intermediate parsing result to the client, so that the client uses the local information base to correct the pinyin stream in the intermediate parsing result to obtain a final parsing result.
  • the semantic parsing module 33 is specifically configured to:
  • the preset fixed sentence parsing template is matched with the text data to obtain a target fixed sentence parsing template in the text data matching, and the target fixed sentence parsing template includes a fixed expression part and a pending expression part;
  • the content of the corresponding fixed expression part in the text data is used as a fixed sentence pattern in the initial parsing result, and the content corresponding to the undetermined expression part in the text data is used as the entity word in the initial parsing result.
  • the fixed sentence parsing template includes a fixed expression portion and a pending expression portion.
  • the fixed expression part is relatively fixed, and generally does not change in different requests of the same application scenario, and the pending expression part is not fixed, and often changes in different requests of the same application scenario.
  • the semantic parsing module 33 is specifically configured to: use a preset universal semantic parsing template to match the text data, and trigger a execution to perform a fixed sentence pattern on the text data when the common semantic parsing template is not matched. Semantic parsing to obtain an operation that includes the initial parsing results of fixed sentences and entity words.
  • the voice signal processing device provided in this embodiment is implemented on the server side, and uses a fixed sentence semantic analysis method to perform semantic analysis on the text data corresponding to the voice signal, and obtains an initial analysis result including a fixed sentence pattern and an entity word, and the initial analysis result is obtained.
  • the entity word in the conversion is converted into a pinyin stream to obtain an intermediate analysis result, and the intermediate analysis result is sent to the client, so that the client can use the local information database to correct the pinyin stream in the intermediate analysis result to obtain the final analysis result, and fully exert the result.
  • the client local information library plays a role in the semantic analysis of some entity words, and corrects the result that the server cannot accurately parse, which improves the accuracy of semantic parsing and helps reduce the number of semantic parsing templates stored on the server.
  • FIG. 4 is a schematic structural diagram of a voice signal processing apparatus according to still another embodiment of the present invention.
  • the device is implemented at the client end. As shown in FIG. 4, the device includes a sending module 41, a receiving module 42, and a correcting module 43.
  • the sending module 41 is configured to send a voice signal to the server for semantic analysis of the voice signal by the server.
  • the receiving module 42 is configured to receive an intermediate parsing result returned by the server, where the intermediate parsing result is obtained by the server converting the entity word in the initial parsing result into a pinyin stream, and the intermediate parsing result includes a fixed sentence pattern in the initial parsing result and The pinyin stream into which the entity word is converted.
  • the correction module 43 is configured to correct the pinyin stream in the intermediate analysis result by using the local information base to obtain a final analysis result.
  • correction module 43 is specifically configured to:
  • the fixed sentence is combined with the entity words corresponding to the pinyin stream to obtain the final parsing result.
  • the correction module 43 may specifically use a minimum edit distance matching algorithm to match the pinyin stream in the local information base to obtain an entity word corresponding to the pinyin stream.
  • the local information base of the voice signal processing device is actually a information base related to the application scenario.
  • the local information base may be an address book.
  • the local information base may It is a local music library, a local video library, and so on.
  • the voice signal processing device provided in this embodiment is implemented at the client end, and cooperates with the server to accurately identify the pinyin stream corresponding to the entity word with high uncertainty according to the information library related to the local application scenario. It helps to improve the accuracy of the final analysis results, and helps to reduce the number of analytical templates.
  • the embodiment of the present application further provides a non-transitory computer readable storage medium, where the non-transitory computer readable storage medium stores computer executable instructions, which can execute the voice in any of the foregoing method embodiments. Signal processing method.
  • FIG. 5 is a schematic structural diagram of hardware of an electronic device for performing a voice signal processing method according to an embodiment of the present disclosure. As shown in FIG. 5, the device includes:
  • processors 510 and memory 520 one processor 510 is taken as an example in FIG.
  • the apparatus for performing the voice signal processing method may further include: an input device 530 and an output device 540.
  • the processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, as exemplified by a bus connection in FIG.
  • the memory 520 is used as a non-transitory computer readable storage medium for storing non-transients a software program, a non-transitory computer executable program, and a module, such as a program instruction/module corresponding to the voice signal processing method in the embodiment of the present application (for example, the receiving module 31, the voice recognition module 32, and the semantic analysis shown in FIG. The module 33, the conversion module 34 and the transmission module 35, or the transmission module 41, the reception module 42 and the correction module 43) shown in FIG.
  • the processor 510 executes various functional applications and data processing of the electronic device by running non-transitory software programs, instructions, and modules stored in the memory 520, that is, implementing the voice signal processing method of the above method embodiment.
  • the memory 520 may include a storage program area and an storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to use of the voice signal processing device, and the like.
  • memory 520 can include high speed random access memory, and can also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device.
  • memory 520 can optionally include memory remotely disposed relative to processor 510, which can be coupled to the voice signal processing device over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • Input device 530 can receive input digital or character information and generate key signal inputs related to user settings and function control of the voice signal processing device.
  • the output device 540 can include a display device such as a display screen.
  • the one or more modules are stored in the memory 520, and when executed by the one or more processors 510, perform a speech signal processing method in any of the above method embodiments.
  • the electronic device of the embodiment of the invention exists in various forms, including but not limited to:
  • Mobile communication devices These devices are characterized by mobile communication functions and are mainly aimed at providing voice and data communication.
  • Such terminals include: smart phones (such as iPhone), multimedia Mobile phones, functional phones, and low-end phones.
  • Ultra-mobile personal computer equipment This type of equipment belongs to the category of personal computers, has computing and processing functions, and generally has mobile Internet access.
  • Such terminals include: PDAs, MIDs, and UMPC devices, such as the iPad.
  • Portable entertainment devices These devices can display and play multimedia content. Such devices include: audio, video players (such as iPod), handheld game consoles, e-books, and smart toys and portable car navigation devices.
  • the server consists of a processor, a hard disk, a memory, a system bus, etc.
  • the server is similar to a general-purpose computer architecture, but because of the need to provide highly reliable services, processing power and stability High reliability in terms of reliability, security, scalability, and manageability.
  • the program when executed, may include the flow of an embodiment of the methods as described above.
  • the storage medium may be a magnetic disk, an optical disk, a read only memory (ROM), or a random access memory (RAM).
  • the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, ie may be located A place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment. Those of ordinary skill in the art can understand and implement without deliberate labor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

一种语音信号处理方法及装置。语音信号处理方法包括:接收客户端发送的语音信号(101);对语音信号进行语音识别,以获得文本数据(102);对文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果(103);将初始解析结果中的实体词转换为拼音流,以获得中间解析结果(104);将中间解析结果发送给客户端,以供客户端利用本地信息库对中间解析结果中的拼音流进行修正后获得最终解析结果(105)。

Description

语音信号处理方法及装置
本申请要求于2016年3月30日提交中国专利局、申请号为201610193074.0、发明名称为“语音信号处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及语音识别技术领域,尤其涉及一种语音信号处理方法及装置。
背景技术
随着语音识别技术的发展,基于语音识别的应用越来越多,例如语音拨号、语音导航、语音播放控制、语音信息检索等。在基于语音识别的应用中,都需要对语音信号进行语义解析,提取语音信号所表达的用户意图,并转换成机器能够理解的结构化数据格式。
现有技术主要使用预设的语义解析模板与语音识别出的字符串进行匹配的方法来对语音信号进行语义解析。这种方法要求语义解析模板足够多,但实际上语义解析模板的数量是有限的,而语音信号的表达方式又是多种多样,所以经常出现一些无法精准匹配的情况,导致无法准确解析语音信号的语义。
发明内容
本发明提供一种语音信号处理方法及装置,用以对语音信号进行语义解析,提高语义解析的准确度。
本发明实施例提供一种语音信号处理方法,包括:
接收客户端发送的语音信号;
对所述语音信号进行语音识别,以获得文本数据;
对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果;
将所述初始解析结果中的实体词转换为拼音流,以获得中间解析结果;
将所述中间解析结果发送给所述客户端,以供所述客户端利用本地信息库对所述中间解析结果中的拼音流进行修正后获得最终解析结果。
本发明实施例提供另一种语音信号处理方法,包括:
向服务端发送语音信号,以供所述服务端对所述语音信号进行语义解析;
接收所述服务端返回的中间解析结果,所述中间解析结果是所述服务端将初始解析结果中的实体词转换为拼音流后获得的,所述中间解析结果包括所述初始解析结果中的固定句式和所述实体词转换成的拼音流;
利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果。
本发明实施例提供一种语音信号处理装置,位于服务端实现,所述装置包括:
接收模块,用于接收客户端发送的语音信号;
语音识别模块,用于对所述语音信号进行语音识别,以获得文本数据;
语义解析模块,用于对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果;
转换模块,用于将所述初始解析结果中的实体词转换为拼音流,以获得中间解析结果;
发送模块,用于将所述中间解析结果发送给所述客户端,以供所述 客户端利用本地信息库对所述中间解析结果中的拼音流进行修正后获得最终解析结果。
本发明实施例提供另一种语音信号处理装置,位于客户端实现,所述装置包括:
发送模块,用于向服务端发送语音信号,以供所述服务端对所述语音信号进行语义解析;
接收模块,用于接收所述服务端返回的中间解析结果,所述中间解析结果是所述服务端将初始解析结果中的实体词转换为拼音流后获得的,所述中间解析结果包括所述初始解析结果中的固定句式和所述实体词转换成的拼音流;
修正模块,用于利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果。
本发明实施例还提供了一种非暂态计算机可读存储介质,其中,该非暂态计算机可读存储介质存储有计算机可执行指令,所述计算机可执行指令用于执行上述语音信号处理方法。
本发明实施例还提供了一种电子设备,包括:一个或多个处理器;以及,存储器;其中,所述存储器存储有可被所述一个或多个处理器执行的指令,所述指令被设置为用于执行上述语音信号处理方法。
本发明实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述语音信号处理方法。
本发明实施例提供的语音信号处理方法及装置,在服务端,采用固定句式语义解析方式对语音信号对应的文本数据进行语义解析,获得包括固定句式和实体词的初始解析结果,将初始解析结果中的实体词转换为拼音流,以获得中间解析结果,将中间解析结果发送给客户端,在客 户端,利用本地信息库对接收到的中间解析结果中的拼音流进行修正,以获得最终解析结果。本发明实施例将服务端解析与客户端修正相结合,充分发挥客户端本地信息库对部分实体词在语义解析上的作用,对服务端无法准确解析的结果进行修正,提高了语义解析的准确度,同时有利于减少服务端存储的语义解析模板的数量。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本发明一实施例提供的语音信号处理方法的流程示意图;
图2为本发明另一实施例提供的语音信号处理方法的流程示意图;
图3为本发明又一实施例提供的语音信号处理装置的结构示意图;
图4为本发明又一实施例提供的语音信号处理装置的结构示意图;
图5为本发明实施例提供的一种电子设备的结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
在现有技术中,主要使用预设的语义解析模板与语音识别出的字符 串进行匹配的方法来对语音信号进行语义解析。这种方法要求语义解析模板足够多,但实际上语义解析模板的数量是有限的,而语音信号的表达方式又是多种多样,所以经常出现一些无法精准匹配的情况,导致无法准确解析语音信号的语义。以语音拨号为例,可能有语音信号“请给张三拨打电话”对应的语义解析模板,但可能没有语音信号“请给李四打电话”的语义解析模板,所以“请给李四打电话”可能无法准备解析出语音信号“请给李四打电话”的语义。
针对上述问题,本发明实施例提供一种解决方案,主要原理是:在服务端,采用固定句式语义解析方式对语音信号对应的文本数据进行语义解析,获得包括固定句式和实体词的初始解析结果,实体词的不确定性较高,因此将初始解析结果中的实体词转换为拼音流,以获得中间解析结果,将中间解析结果发送给客户端,在客户端,利用本地信息库对接收到的中间解析结果中的拼音流进行修正,以获得最终解析结果。
本发明实施例将服务端解析与客户端修正相结合,充分发挥客户端本地信息库对部分实体词在语义解析上的作用,对服务端无法准确解析的结果进行修正,提高了语义解析的准确度,同时有利于减少服务端存储的语义解析模板的数量。
下面通过具体实施例对本发明技术方案进行详细说明。
图1为本发明一实施例提供的语音信号处理方法的流程示意图。如图1所示,该方法包括:
101、接收客户端发送的语音信号。
102、对语音信号进行语音识别,以获得文本数据。
103、对文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果。
104、将初始解析结果中的实体词转换为拼音流,以获得中间解析结果。
105、将中间解析结果发送给所述客户端,以供客户端利用本地信息库对中间解析结果中的拼音流进行修正后获得最终解析结果。
本实施例提供一种语音信号处理方法,可由语音信号处理装置来执行,用以对语音信号进行语义解析,提高语义解析的准确度。
本实施例提供的方法适用于各种需要对语音信号进行语义解析的应用场景,例如语音拨号、语音导航、语音播放控制、语音信息检索等。其中,语音信号处理装置可位于各应用场景中的服务端实现。
具体的,在各应用场景中,客户端采集用户的语音信号,例如对用户的语音进行录制,从而获取用户的语音信号,然后将语音信号发送给服务端,具体来说是发送给服务端的语音信号处理装置。语音信号处理装置接收客户端发送的语音信号。
可选的,客户端发送语音信号之前,可以对语音信号进行模数转换、编码、压缩等处理。相应的,语音信号处理装置接收到语音信号之后,可以对语音信号经解压缩、解码等处理,并针对处理后的语音信号进行语义解析处理。
在获得语音信号之后,语音信号处理装置可以对语音信号进行语音识别,以获得文本数据。举例说明,假设用户输入的语音信号为“我要给张三打电话”,则可以将该语音信号识别为对应的文本数据。其中,对语音信号进行语音识别的具体方案可以参见现有技术,在此不做详述。
在获得语音信号对应的文本数据之后,语音信号处理装置可以对文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果。本实施例中的固定句式语义解析不同于现有技术中的通用语义解析。其中,通用语义解析是指利用预设的通用语义解析模板与文本数据进行匹配以获得文本数据对应的语义的方案。而本实施例的固定句式语义解析是指利用预设的固定句式解析模板与文本数据进行匹配,以获得文本数据对应的语义的方案。
在本实施例中,固定句式解析模板包括固定表达部分和待定表达部 分。固定表达部分是比较固定的,在相同应用场景的不同请求中一般不会发生变化,而待定表达部分是不固定的,在相同应用场景的不同请求中往往会发生变化。
例如,“请给xxx打电话”为一固定句式解析模板,其中“请给…打电话”是该固定句式解析模板中的固定表达部分,而其中的“xxx”是该固定句式解析模板中的待定表达部分,在该固定句式解析模板中,待定表达部分主要指姓名,在不同拨号请求中,请求呼叫的对象姓名经常是不同的。
又例如,“请播放歌曲xxx”为另一固定句式解析模板,其中“请播放歌曲…”是该固定句式解析模板中的固定表达部分,而其中的“xxx”是该固定句式解析模板中的待定表达部分,在该固定句式解析模板中,待定表达部分主要指歌曲名,在不同播放请求中,请求播放的歌曲经常是不同的。
又例如,“请搜索歌曲xxx的歌词”为又一固定句式解析模板,其中“请搜索歌曲…的歌词”是该固定句式解析模板中的固定表达部分,而其中的“xxx”是该固定句式解析模板中的待定表达部分,在该固定句式解析模板中,待定表达部分主要指歌曲名,在不同搜索请求中,请求搜索的歌词经常是不同歌曲的。
基于上述,语音信号处理装置具体可以利用预设的固定句式解析模板与上述语音信号对应的文本数据进行匹配,以获得该文本数据匹配中的固定句式解析模板,为便于描述,将文本数据匹配中的固定句式解析模板称为目标固定句式解析模板。该目标固定句式解析模板也包括固定表达部分和待定表达部分。之后,语音信号处理装置将文本数据中对应目标固定句式解析模板中的固定表达部分的内容作为初始解析结果中的固定句式,将文本数据中对应目标固定句式解析模板中的待定表达部分的内容作为初始解析结果中的实体词。
例如,假设语音信号对应的文本数据为“请给张三打电话”,则与 该文本数据匹配中的目标固定句式解析模板为“请给xxx打电话”,则可以将固定表达部分“请给…打电话”作为初始解析结果中的固定句式,将“张三”作为初始解析结果中的实体词。
又例如,假设语音信号对应的文本数据为“请播放歌曲小燕子”,则与该文本数据匹配中的目标固定句式解析模板为“请播放歌曲xxx”,则可以将固定表达部分“请播放歌曲…”作为初始解析结果中的固定句式,将“小燕子”作为初始解析结果中的实体词。
又例如,假设语音信号对应的文本数据为“请搜索歌曲童年的歌词”,则与该文本数据匹配中的目标固定句式解析模板为“请搜索歌曲xxx的歌词”,则可以将固定表达部分“请搜索歌曲…的歌词”作为初始解析结果中的固定句式,将“童年”作为初始解析结果中的实体词。
在一可选实施方式中,在对语音信号对应的文本数据进行固定句式语义解析之前,可以先对文本数据进行通用语义解析。具体的,语音信号处理装置可以先利用预设的通用语义解析模板与所述文本数据进行匹配;若未匹配中通用语义解析模板,语音信号处理装置可以继续对文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果。
进一步,若匹配中通用语义解析模板,则根据匹配中的通用语义解析模板获得文本数据的解析结果并返回给客户端,以供客户端根据该解析结果执行相应操作。
在本实施例中,在获得文本数据对应的初始解析结果之后,并不是像现有技术那样直接将初始解析结果返回给客户端。考虑到初始解析结果中实体词的不确定性,有可能语音识别出的是错误的,例如语音信号中的“张三”,可以被识别为“张伞”,为提高对实体词的识别结果,语音信号处理装置将初始解析结果中的实体词转换为拼音流,例如将“张伞”转换为“zhang san”,从而获得中间解析结果。举例说明,初始解析结果为“请给张伞打电话”,经过拼音流转换后的中间解析结果为“请 给zhang san打电话”。
在获得中间解析结果之后,语音信号处理装置将中间解析结果发送给客户端。对客户端来说,接收语音信号处理装置发送的中间解析结果,利用本地信息库对中间解析结果中的拼音流进行修正,以获得最终解析结果。具体的,客户端可以将该拼音流在本地信息库中进行匹配,例如可以采用最小编辑距离匹配算法,以获得该拼音流对应的实体词,进而用该实体词替换拼音流,获得最终解析结果。
值得说明的是,客户端的本地信息库实际上是与客户端所属应用场景相关的信息库,例如若客户端属于语音拨号场景,则本地信息库可以是通讯录,若客户端属于语音播放控制场景,则本地信息库可以是本地音乐库。
本实施例通过将不确定性较强的实体词转换为拼音流并发送给客户端,有利于客户端根据本地与应用场景相关的信息库准确确定该拼音流对应的实体词,提高最终解析结果的准确度。另外,由于不确定性较强的实体词交由客户端根据具体应用场景来确定,使得服务端只需存储固定句式解析模板即可,不用像现有技术中存储每个实体词对应的通用语义解析模板,有利于减少解析模板的数量。
图2为本发明另一实施例提供的语音信号处理方法的流程示意图。如图2所示,该方法包括:
201、向服务端发送语音信号,以供服务端对语音信号进行语义解析。
202、接收服务端返回的中间解析结果,中间解析结果是服务端将初始解析结果中的实体词转换为拼音流后获得的,中间解析结果包括初始解析结果中的固定句式和实体词转换成的拼音流。
203、利用本地信息库对中间解析结果中的拼音流进行修正,以获得最终解析结果。
本实施例提供一种语音信号处理方法,可由语音信号处理装置来执 行,用以对语音信号进行语义解析,提高语义解析的准确度。
本实施例提供的方法适用于各种需要对语音信号进行语义解析的应用场景,例如语音拨号、语音导航、语音播放控制、语音信息检索等。其中,语音信号处理装置可位于各应用场景中的客户端实现。
具体的,在各应用场景中,语音信号处理装置采集用户的语音信号,例如对用户的语音进行录制,从而获取用户的语音信号,然后将语音信号发送给服务端,以供服务端对语音信号进行语义解析。
其中,服务端对语音信号进行语义解析的过程可参见图1所示实施例的描述,在此不再赘述。
在将语音信号发送给服务端之后,语音信号处理装置等待接收服务端返回的中间解析结果,在接收到中间解析结果后,利用本地信息库对中间解析结果中的拼音流进行修正,以获得最终解析结果。
具体的,语音信号处理装置将拼音流在本地信息库中进行匹配,以获得拼音流对应的实体词;将中间解析结果中的固定句式与拼音流对应的实体词进行组合,以获得最终解析结果。
例如,语音信号处理装置可以采用最小编辑距离匹配算法,将该拼音流在本地信息库中进行匹配,以获得该拼音流对应的实体词。
值得说明的是,语音信号处理装置的本地信息库实际上是与应用场景相关的信息库,例如若是语音拨号场景,则本地信息库可以是通讯录,若是语音播放控制场景,则本地信息库可以是本地音乐库、本地视频库等。
另外,语音信号处理装置在获得最终解析结果后,可以根据最终解析结果进行相应操作,例如根据最终解析结果进行拨号控制,或者根据最终解析结果进行播放控制,或者根据最终解析结果进行搜索等。
在此说明,若语音信号处理装置未能在本地信息库中匹配到拼音流对应的实体词,语音信号处理装置可以直接拒绝用户的相应请求,例如 拒绝拨号、拒绝播放歌曲或拒绝搜索歌词等。或者,若语音信号处理装置未能在本地信息库中匹配到拼音流对应的实体词,语音信号处理装置可以通过与用户交互的方式进行处理,例如向用户输出提示信息,以供用户确定是否继续执行相应操作,并根据用户的指示执行相应操作。
在本实施例中,语音信号处理装置与服务端相配合,根据本地与应用场景相关的信息库,能够对不确定性较高的实体词对应的拼音流进行准确识别,有利于提高最终解析结果的准确度,同时有利于减少解析模板的数量。
图3为本发明又一实施例提供的语音信号处理装置的结构示意图。该装置位于服务端中实现。如图3所示,该装置包括:接收模块31、语音识别模块32、语义解析模块33、转换模块34和发送模块35。
接收模块31,用于接收客户端发送的语音信号。
语音识别模块32,用于对语音信号进行语音识别,以获得文本数据。
语义解析模块33,用于对文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果。
转换模块34,用于将初始解析结果中的实体词转换为拼音流,以获得中间解析结果。
发送模块35,用于将中间解析结果发送给客户端,以供客户端利用本地信息库对中间解析结果中的拼音流进行修正后获得最终解析结果。
在一可选实施方式中,语义解析模块33具体用于:
利用预设的固定句式解析模板与文本数据进行匹配,以获得文本数据匹配中的目标固定句式解析模板,目标固定句式解析模板包括固定表达部分和待定表达部分;
将文本数据中对应固定表达部分的内容作为初始解析结果中的固定句式,将文本数据中对应待定表达部分的内容作为初始解析结果中的实体词。
在本实施例中,固定句式解析模板包括固定表达部分和待定表达部分。固定表达部分是比较固定的,在相同应用场景的不同请求中一般不会发生变化,而待定表达部分是不固定的,在相同应用场景的不同请求中往往会发生变化。
在一可选实施方式中,语义解析模块33具体用于:利用预设的通用语义解析模板与文本数据进行匹配,并在未匹配中通用语义解析模板时,触发执行对文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果的操作。
本实施例提供的语音信号处理装置,位于服务端实现,采用固定句式语义解析方式对语音信号对应的文本数据进行语义解析,获得包括固定句式和实体词的初始解析结果,将初始解析结果中的实体词转换为拼音流,以获得中间解析结果,将中间解析结果发送给客户端,使得客户端可以利用本地信息库对中间解析结果中的拼音流进行修正后获得最终解析结果,充分发挥了客户端本地信息库对部分实体词在语义解析上的作用,对服务端无法准确解析的结果进行修正,提高了语义解析的准确度,同时有利于减少服务端存储的语义解析模板的数量。
图4为本发明又一实施例提供的语音信号处理装置的结构示意图。该装置位于客户端实现,如图4所示,该装置包括:发送模块41、接收模块42和修正模块43。
发送模块41,用于向服务端发送语音信号,以供服务端对语音信号进行语义解析。
接收模块42,用于接收服务端返回的中间解析结果,中间解析结果是服务端将初始解析结果中的实体词转换为拼音流后获得的,中间解析结果包括初始解析结果中的固定句式和实体词转换成的拼音流。
修正模块43,用于利用本地信息库对中间解析结果中的拼音流进行修正,以获得最终解析结果。
在一可选实施方式中,修正模块43具体用于:
将拼音流在本地信息库中进行匹配,以获得拼音流对应的实体词;
将固定句式与拼音流对应的实体词进行组合,以获得最终解析结果。
例如,修正模块43具体可以采用最小编辑距离匹配算法,将该拼音流在本地信息库中进行匹配,以获得该拼音流对应的实体词。
值得说明的是,语音信号处理装置的本地信息库实际上是与应用场景相关的信息库,例如若是语音拨号场景,则本地信息库可以是通讯录,若是语音播放控制场景,则本地信息库可以是本地音乐库、本地视频库等。
本实施例提供的语音信号处理装置,位于客户端实现,与服务端相配合,根据本地与应用场景相关的信息库,能够对不确定性较高的实体词对应的拼音流进行准确识别,有利于提高最终解析结果的准确度,同时有利于减少解析模板的数量。
本申请实施例还提供了一种非暂态计算机可读存储介质,所述非暂态计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令可执行上述任意方法实施例中的语音信号处理方法。
图5是本申请实施例提供的执行语音信号处理方法的电子设备的硬件结构示意图,如图5所示,该设备包括:
一个或多个处理器510以及存储器520,图5中以一个处理器510为例。
执行语音信号处理方法的设备还可以包括:输入装置530和输出装置540。
处理器510、存储器520、输入装置530和输出装置540可以通过总线或者其他方式连接,图5中以通过总线连接为例。
存储器520作为一种非暂态计算机可读存储介质,可用于存储非暂态 软件程序、非暂态计算机可执行程序以及模块,如本申请实施例中的语音信号处理方法对应的程序指令/模块(例如,附图3所示的接收模块31、语音识别模块32、语义解析模块33、转换模块34和发送模块35,或者,附图4所示的发送模块41、接收模块42和修正模块43)。处理器510通过运行存储在存储器520中的非暂态软件程序、指令以及模块,从而执行电子设备的各种功能应用以及数据处理,即实现上述方法实施例语音信号处理方法。
存储器520可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据语音信号处理装置的使用所创建的数据等。此外,存储器520可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器520可选包括相对于处理器510远程设置的存储器,这些远程存储器可以通过网络连接至语音信号处理装置。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入装置530可接收输入的数字或字符信息,以及产生与语音信号处理装置的用户设置以及功能控制有关的键信号输入。输出装置540可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器520中,当被所述一个或者多个处理器510执行时,执行上述任意方法实施例中的语音信号处理方法。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例所提供的方法。
本发明实施例的电子设备以多种形式存在,包括但不限于:
(1)移动通信设备:这类设备的特点是具备移动通信功能,并且以提供话音、数据通信为主要目标。这类终端包括:智能手机(例如iPhone)、多媒 体手机、功能性手机,以及低端手机等。
(2)超移动个人计算机设备:这类设备属于个人计算机的范畴,有计算和处理功能,一般也具备移动上网特性。这类终端包括:PDA、MID和UMPC设备等,例如iPad。
(3)便携式娱乐设备:这类设备可以显示和播放多媒体内容。该类设备包括:音频、视频播放器(例如iPod),掌上游戏机,电子书,以及智能玩具和便携式车载导航设备。
(4)服务器:提供计算服务的设备,服务器的构成包括处理器、硬盘、内存、系统总线等,服务器和通用的计算机架构类似,但是由于需要提供高可靠的服务,因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。
(5)其他具有数据交互功能的电子装置。
最后需要说明的是,本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非暂态计算机可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到 各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。

Claims (15)

  1. 一种语音信号处理方法,其特征在于,应用于服务端,包括:
    接收客户端发送的语音信号;
    对所述语音信号进行语音识别,以获得文本数据;
    对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果;
    将所述初始解析结果中的实体词转换为拼音流,以获得中间解析结果;
    将所述中间解析结果发送给所述客户端,以供所述客户端利用本地信息库对所述中间解析结果中的拼音流进行修正后获得最终解析结果。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果,包括:
    利用预设的固定句式解析模板与所述文本数据进行匹配,以获得所述文本数据匹配中的目标固定句式解析模板,所述目标固定句式解析模板包括固定表达部分和待定表达部分;
    将所述文本数据中对应所述固定表达部分的内容作为所述初始解析结果中的固定句式,将所述文本数据中对应所述待定表达部分的内容作为所述初始解析结果中的实体词。
  3. 根据权利要求1或2所述的方法,其特征在于,所述对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果之前,包括:
    利用预设的通用语义解析模板与所述文本数据进行匹配,并在未匹配中通用语义解析模板时,触发执行对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果的操作。
  4. 一种语音信号处理方法,其特征在于,应用于客户端,包括:
    向服务端发送语音信号,以供所述服务端对所述语音信号进行语 义解析;
    接收所述服务端返回的中间解析结果,所述中间解析结果是所述服务端将初始解析结果中的实体词转换为拼音流后获得的,所述中间解析结果包括所述初始解析结果中的固定句式和所述实体词转换成的拼音流;
    利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果。
  5. 根据权利要求4所述的方法,其特征在于,所述利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果,包括:
    将所述拼音流在所述本地信息库中进行匹配,以获得所述拼音流对应的实体词;
    将所述固定句式与所述拼音流对应的实体词进行组合,以获得所述最终解析结果。
  6. 一种语音信号处理装置,位于服务端实现,其特征在于,所述装置包括:
    接收模块,用于接收客户端发送的语音信号;
    语音识别模块,用于对所述语音信号进行语音识别,以获得文本数据;
    语义解析模块,用于对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果;
    转换模块,用于将所述初始解析结果中的实体词转换为拼音流,以获得中间解析结果;
    发送模块,用于将所述中间解析结果发送给所述客户端,以供所述客户端利用本地信息库对所述中间解析结果中的拼音流进行修正后获得最终解析结果。
  7. 根据权利要求6所述的装置,其特征在于,所述语义解析模块具体用于:
    利用预设的固定句式解析模板与所述文本数据进行匹配,以获得所述文本数据匹配中的目标固定句式解析模板,所述目标固定句式解析模板包括固定表达部分和待定表达部分;
    将所述文本数据中对应所述固定表达部分的内容作为所述初始解析结果中的固定句式,将所述文本数据中对应所述待定表达部分的内容作为所述初始解析结果中的实体词。
  8. 根据权利要求6或7所述的装置,其特征在于,所述语义解析模块具体用于:
    利用预设的通用语义解析模板与所述文本数据进行匹配,并在未匹配中通用语义解析模板时,触发执行对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果的操作。
  9. 一种语音信号处理装置,位于客户端实现,其特征在于,所述装置包括:
    发送模块,用于向服务端发送语音信号,以供所述服务端对所述语音信号进行语义解析;
    接收模块,用于接收所述服务端返回的中间解析结果,所述中间解析结果是所述服务端将初始解析结果中的实体词转换为拼音流后获得的,所述中间解析结果包括所述初始解析结果中的固定句式和所述实体词转换成的拼音流;
    修正模块,用于利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果。
  10. 根据权利要求9所述的装置,其特征在于,所述修正模块具体用于:
    将所述拼音流在所述本地信息库中进行匹配,以获得所述拼音流对应的实体词;
    将所述固定句式与所述拼音流对应的实体词进行组合,以获得所述最终解析结果。
  11. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计 算机执行:
    接收客户端发送的语音信号;
    对所述语音信号进行语音识别,以获得文本数据;
    对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果;
    将所述初始解析结果中的实体词转换为拼音流,以获得中间解析结果;
    将所述中间解析结果发送给所述客户端,以供所述客户端利用本地信息库对所述中间解析结果中的拼音流进行修正后获得最终解析结果。
  12. 一种非暂态计算机可读存储介质,其特征在于,所述非暂态计算机可读存储介质存储计算机指令,所述计算机指令用于使所述计算机执行:
    向服务端发送语音信号,以供所述服务端对所述语音信号进行语义解析;
    接收所述服务端返回的中间解析结果,所述中间解析结果是所述服务端将初始解析结果中的实体词转换为拼音流后获得的,所述中间解析结果包括所述初始解析结果中的固定句式和所述实体词转换成的拼音流;
    利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果。
  13. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    接收客户端发送的语音信号;
    对所述语音信号进行语音识别,以获得文本数据;
    对所述文本数据进行固定句式语义解析,以获得包括固定句式和实体词的初始解析结果;
    将所述初始解析结果中的实体词转换为拼音流,以获得中间解析结果;
    将所述中间解析结果发送给所述客户端,以供所述客户端利用本地信息库对所述中间解析结果中的拼音流进行修正后获得最终解析结果。
  14. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够:
    向服务端发送语音信号,以供所述服务端对所述语音信号进行语义解析;
    接收所述服务端返回的中间解析结果,所述中间解析结果是所述服务端将初始解析结果中的实体词转换为拼音流后获得的,所述中间解析结果包括所述初始解析结果中的固定句式和所述实体词转换成的拼音流;
    利用本地信息库对所述中间解析结果中的拼音流进行修正,以获得最终解析结果。
  15. 一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行权利要求1至5任一项所述的方法。
PCT/CN2016/096984 2016-03-30 2016-08-26 语音信号处理方法及装置 WO2017166649A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610193074.0 2016-03-30
CN201610193074.0A CN105895090A (zh) 2016-03-30 2016-03-30 语音信号处理方法及装置

Publications (1)

Publication Number Publication Date
WO2017166649A1 true WO2017166649A1 (zh) 2017-10-05

Family

ID=57014826

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/096984 WO2017166649A1 (zh) 2016-03-30 2016-08-26 语音信号处理方法及装置

Country Status (2)

Country Link
CN (1) CN105895090A (zh)
WO (1) WO2017166649A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091408A (zh) * 2020-08-04 2022-02-25 科沃斯商用机器人有限公司 文本纠正、模型训练方法、纠正模型、设备及机器人
CN115346531A (zh) * 2022-08-02 2022-11-15 启迪万众网络科技(北京)有限公司 一种语音媒体处理用语音转文字识别系统
CN115662430A (zh) * 2022-10-28 2023-01-31 阿波罗智联(北京)科技有限公司 输入数据解析方法、装置、电子设备和存储介质

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105895090A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音信号处理方法及装置
CN106412678A (zh) * 2016-09-14 2017-02-15 安徽声讯信息技术有限公司 一种视频新闻实时转写存储方法及系统
CN107016070B (zh) * 2017-03-22 2020-06-02 北京光年无限科技有限公司 一种用于智能机器人的人机对话方法及装置
CN107273364A (zh) * 2017-05-15 2017-10-20 百度在线网络技术(北京)有限公司 一种语音翻译方法和装置
CN108010525A (zh) * 2017-12-07 2018-05-08 横琴七弦琴知识产权服务有限公司 一种语音控制智能抽屉系统
CN108009303B (zh) * 2017-12-30 2021-09-14 北京百度网讯科技有限公司 基于语音识别的搜索方法、装置、电子设备和存储介质
CN108228191B (zh) * 2018-02-06 2022-01-25 威盛电子股份有限公司 语法编译系统以及语法编译方法
CN109147784B (zh) * 2018-09-10 2021-06-08 百度在线网络技术(北京)有限公司 语音交互方法、设备以及存储介质
CN109256125B (zh) * 2018-09-29 2022-10-14 阿波罗智联(北京)科技有限公司 语音的离线识别方法、装置与存储介质
CN111292751B (zh) * 2018-11-21 2023-02-28 北京嘀嘀无限科技发展有限公司 语义解析方法及装置、语音交互方法及装置、电子设备
CN109977405A (zh) * 2019-03-26 2019-07-05 北京博瑞彤芸文化传播股份有限公司 一种智能语义匹配方法
CN110008471A (zh) * 2019-03-26 2019-07-12 北京博瑞彤芸文化传播股份有限公司 一种基于拼音转换的智能语义匹配方法
CN110164435B (zh) * 2019-04-26 2024-06-25 平安科技(深圳)有限公司 语音识别方法、装置、设备及计算机可读存储介质
CN110767219B (zh) * 2019-09-17 2021-12-28 中国第一汽车股份有限公司 语义更新方法、装置、服务器和存储介质
CN113127610B (zh) * 2019-12-31 2024-04-19 北京猎户星空科技有限公司 一种数据处理方法、装置、设备及介质
CN111554295B (zh) * 2020-04-24 2021-06-22 科大讯飞(苏州)科技有限公司 文本纠错方法、相关设备及可读存储介质
CN113076397A (zh) * 2021-03-29 2021-07-06 Oppo广东移动通信有限公司 意图识别方法、装置、电子设备及存储介质
CN115294976A (zh) * 2022-06-23 2022-11-04 中国第一汽车股份有限公司 一种基于车载语音场景的纠错交互方法、系统及其车辆

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1674091A (zh) * 2005-04-18 2005-09-28 南京师范大学 地理信息的语音识别方法及其在导航系统中的应用
CN102682763A (zh) * 2011-03-10 2012-09-19 北京三星通信技术研究有限公司 修正语音输入文本中命名实体词汇的方法、装置及终端
CN103377652A (zh) * 2012-04-25 2013-10-30 上海智臻网络科技有限公司 一种用于进行语音识别的方法、装置和设备
CN103594085A (zh) * 2012-08-16 2014-02-19 百度在线网络技术(北京)有限公司 一种提供语音识别结果的方法及系统
CN103680505A (zh) * 2013-09-03 2014-03-26 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN104485106A (zh) * 2014-12-08 2015-04-01 畅捷通信息技术股份有限公司 语音识别方法、语音识别系统和语音识别设备
CN105206274A (zh) * 2015-10-30 2015-12-30 北京奇艺世纪科技有限公司 一种语音识别的后处理方法及装置和语音识别系统
CN105895090A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音信号处理方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1674091A (zh) * 2005-04-18 2005-09-28 南京师范大学 地理信息的语音识别方法及其在导航系统中的应用
CN102682763A (zh) * 2011-03-10 2012-09-19 北京三星通信技术研究有限公司 修正语音输入文本中命名实体词汇的方法、装置及终端
CN103377652A (zh) * 2012-04-25 2013-10-30 上海智臻网络科技有限公司 一种用于进行语音识别的方法、装置和设备
CN103594085A (zh) * 2012-08-16 2014-02-19 百度在线网络技术(北京)有限公司 一种提供语音识别结果的方法及系统
CN103680505A (zh) * 2013-09-03 2014-03-26 安徽科大讯飞信息科技股份有限公司 语音识别方法及系统
CN104485106A (zh) * 2014-12-08 2015-04-01 畅捷通信息技术股份有限公司 语音识别方法、语音识别系统和语音识别设备
CN105206274A (zh) * 2015-10-30 2015-12-30 北京奇艺世纪科技有限公司 一种语音识别的后处理方法及装置和语音识别系统
CN105895090A (zh) * 2016-03-30 2016-08-24 乐视控股(北京)有限公司 语音信号处理方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114091408A (zh) * 2020-08-04 2022-02-25 科沃斯商用机器人有限公司 文本纠正、模型训练方法、纠正模型、设备及机器人
CN115346531A (zh) * 2022-08-02 2022-11-15 启迪万众网络科技(北京)有限公司 一种语音媒体处理用语音转文字识别系统
CN115662430A (zh) * 2022-10-28 2023-01-31 阿波罗智联(北京)科技有限公司 输入数据解析方法、装置、电子设备和存储介质
CN115662430B (zh) * 2022-10-28 2024-03-29 阿波罗智联(北京)科技有限公司 输入数据解析方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN105895090A (zh) 2016-08-24

Similar Documents

Publication Publication Date Title
WO2017166649A1 (zh) 语音信号处理方法及装置
US12002452B2 (en) Background audio identification for speech disambiguation
US20230082343A1 (en) Systems and Methods for Identifying a Set of Characters in a Media File
CN110800046B (zh) 语音识别及翻译方法以及翻译装置
US20190196779A1 (en) Intelligent personal assistant interface system
WO2017113973A1 (zh) 一种音频识别方法和装置
WO2017201935A1 (zh) 视频播放方法及装置
WO2017166650A1 (zh) 语音识别方法及装置
WO2017166651A1 (zh) 语音识别模型训练方法、说话人类型识别方法及装置
CN110164435A (zh) 语音识别方法、装置、设备及计算机可读存储介质
JP2020004376A (ja) 第三者アプリケーションのインタラクション方法、及びシステム
CN111090727B (zh) 语言转换处理方法、装置及方言语音交互系统
US20150310861A1 (en) Processing natural language user inputs using context data
CN104598502A (zh) 获取播放视频中背景音乐信息的方法、装置及系统
US9224385B1 (en) Unified recognition of speech and music
US12046230B2 (en) Methods for natural language model training in natural language understanding (NLU) systems
US11393455B2 (en) Methods for natural language model training in natural language understanding (NLU) systems
WO2014059863A1 (zh) 字幕查询方法、电子设备及存储介质
WO2018133656A1 (zh) 将语音输入转换成文本输入的方法、装置和语音输入设备
JP2011232619A (ja) 音声認識装置および音声認識方法
WO2022116821A1 (zh) 基于多语言机器翻译模型的翻译方法、装置、设备和介质
US11574127B2 (en) Methods for natural language model training in natural language understanding (NLU) systems
US11392771B2 (en) Methods for natural language model training in natural language understanding (NLU) systems
JP2019185737A (ja) 検索方法及びそれを用いた電子機器
CN109190116B (zh) 语义解析方法、系统、电子设备及存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16896419

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 16896419

Country of ref document: EP

Kind code of ref document: A1