WO2021073138A1 - Audio output method and system - Google Patents

Audio output method and system Download PDF

Info

Publication number
WO2021073138A1
WO2021073138A1 PCT/CN2020/097000 CN2020097000W WO2021073138A1 WO 2021073138 A1 WO2021073138 A1 WO 2021073138A1 CN 2020097000 W CN2020097000 W CN 2020097000W WO 2021073138 A1 WO2021073138 A1 WO 2021073138A1
Authority
WO
WIPO (PCT)
Prior art keywords
keyword
audio
voice
standard
text information
Prior art date
Application number
PCT/CN2020/097000
Other languages
French (fr)
Chinese (zh)
Inventor
蔡继发
宋飞豹
倪合强
姚寿柏
Original Assignee
苏宁易购集团股份有限公司
苏宁云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏宁易购集团股份有限公司, 苏宁云计算有限公司 filed Critical 苏宁易购集团股份有限公司
Priority to CA3158353A priority Critical patent/CA3158353A1/en
Publication of WO2021073138A1 publication Critical patent/WO2021073138A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/162Interface to dedicated audio devices, e.g. audio drivers, interface to CODECs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/16Vocoder architecture
    • G10L19/18Vocoders using multiple modes
    • G10L19/20Vocoders using multiple modes using sound class specific coding, hybrid encoders or object based coding

Definitions

  • Audio is a necessary means to promote effective interaction between people and between people and the sounding body.
  • Most of the existing knowledge, information, interactive entertainment, emotional expression, etc. use audio as a medium for information transmission, and these are all manifested without exception.
  • the audio has the characteristics of efficient communication and easy acceptance.
  • audio needs to carry more and more information.
  • it is necessary to interact with this audio in some way.
  • it is still somewhat difficult to convey what is said clearly and accurately.
  • some people have limited English proficiency, and may have substandard pronunciation and repeated words. The spoken English may be as good as possible. Less than the inner request.
  • the first text information is filtered, and then keywords are extracted.
  • the keywords are marked according to the number of occurrences. When the number of times the same keyword is marked exceeds a preset threshold and the keyword is not in In the standard lexicon, import the keywords into the standard lexicon;
  • the filtering of the first text information includes at least: recognition of modal particles and filtering of repeated words.
  • a preferred storage area is provided in the standard vocabulary, and the preferred storage area is used to store the candidate association content with the highest priority, and the candidate association content corresponds to the key whose number of occurrences exceeds a preset threshold. word.
  • Prioritization methods include at least Bayesian decision-making methods.
  • an embodiment of the present invention also provides an audio output system, including:
  • the collection and input module is used to obtain the continuous voice information collected by the client, segment the voice information, decode the voice information after the segmentation, and then convert the voice information after the segmentation into And store the first text information;
  • the acquisition and input module includes a voice cutting unit, the voice cutting unit is used to convert the voice information into voice coded data through pulse code modulation according to a sequence of time stamps. Similarity detection is performed on the voice coded data, the repetitive splicing area of the voice coded data is marked with the most suitable endpoint, and the repetitive data of the voice coded data is cut and filtered.
  • the search association module includes an intelligent search unit and a prioritization unit.
  • the search for the keyword by the intelligent search unit includes searching for synonyms, homonyms, and low-proportion typos of the keyword.
  • the priority ranking performed by the prioritizing unit includes at least a Bayesian decision method.
  • Reinforcement learning is performed so that the standard vocabulary can continuously store the candidate association content corresponding to the high priority, and the matching text containing the keyword is formed into the second text information through the Bayesian decision method, and the second text information is The text information is prioritized by the degree of relevance, and then the second text information is converted into the corresponding standard audio according to the order of priority, so that the application scenarios are different, and the final input is similar to the original voice information sent by the user. Audio.
  • the embodiment of the present invention can avoid the situation of inaccurate pronunciation and wrong pronunciation of users in the process of communicating through the terminal, and can also intelligently associate the voice that the audio sender wants to express through the method of machine learning and the extraction of keywords. Through the machine learning of communication methods, the interactive experience of users when communicating is improved.
  • the implementation of the audio output method in the embodiment of the present invention is simple, fast, and highly flexible, and has a wide range of application scenarios.
  • FIG. 2 is a schematic diagram of the process of searching the lexicon in the audio output method provided by the embodiment of the present invention
  • S1 Acquire continuous voice information collected by the client, segment the voice information, decode the segmented voice information, and then convert the segmented voice information into first text information And store
  • S2 Filter the first text information, then extract keywords, mark the keywords according to the number of occurrences, when the number of times the same keyword is marked exceeds a preset threshold and the keywords If it is not in the standard dictionary, then import the keywords into the standard dictionary;
  • the length of the audio data is fixed.
  • the length of each piece of collected audio is 15 seconds, but the single sampling interval is fixed at 10 seconds, so that there will be 5 seconds for each piece of collected audio
  • the voice information corresponding to the audio needs to be segmented and cut.
  • the segmentation of the voice information includes the following steps: The sequence of stamps is converted into speech coded data by pulse code modulation, the similarity of the speech coded data of two consecutive fixed time lengths is detected, the repetitive splicing area of the speech coded data is marked with the most suitable endpoint, and the selection is cut and filtered. The repeated data of the speech coded data.
  • step S3 is a further description of step S3, which specifically includes the following steps:
  • S301 Search the filtered word segmentation for the keywords in the thesaurus, and determine whether the part of speech and frequency of the non-existent word segmentation are greater than three to determine whether to add it into the thesaurus;
  • S302 Use deep reinforcement learning to perform similar keyword matching for word segmentation, including matching functions such as synonyms, homophone typos, and low-proportion typos correction;
  • S303 According to the Bayesian decision mechanism, count the corresponding content and the time of occurrence of the high-frequency keywords, and formulate the priority search order;
  • the filtering of the first text information includes at least: recognition of modal particles and filtering of repeated words.
  • the decoding buffer also includes user-accustomed expression terms, such as expression terms repeated multiple times (three times or more) consecutively in the audio.
  • the correlation degree in the entire thesaurus is arranged from high to low as: decoding buffer, keyword search, and near (synonymous) word search.
  • the recognized text results are processed by word segmentation, for some existing modal particles, such as “umhhhhh”, “ahhhhh”, “ohhhhh”, “yahhhh”, etc., as well as single repeated words: “ trekking trip “” Yes, yes “,” ha ha ha “,” la la la “and so on, as well as some of the statements replied:” Yes Yes Yes "," good good good “,” right Right, right, etc., as well as colloquial words: "that that that", “six six six”, “oh ye oh ye oh min”, etc. These words are filtered and not entered as keyword searches.
  • the keyword search and decoding define the synonyms and synonyms of the current audio content to match the keywords in the thesaurus, at this time the priority is reduced by one level; in the keyword search and decoding, intelligent fuzzy matching of the recognized content , For example, "Once bitten by a snake” is recognized as “being bitten by a snake once", and "four, four, five” is recognized as "rounding", etc.;
  • the output of the standard audio corresponds to the application places of different clients.
  • the output is at least three optional standard audios arranged according to the degree of relevance; when it is a double-sided client
  • the output is the standard audio with the highest degree of relevance.
  • the corresponding client is the same device, it is a single-sided client at this time, and the output of the client can be selected because the input and input at this time are the same audio sender.
  • the standard audio output through the client at this time is the input audio of the audio sender by default.
  • an embodiment of the present invention also provides an audio output system, including:
  • the collection and input module 1 is used to obtain the continuous voice information collected by the client terminal 5, segment the voice information, decode the voice information after the segmentation, and then convert the voice information after the segmentation Convert it into the first text message and store it;
  • the keyword processing module 2 is used for filtering the first text information, and then extracting keywords, marking the keywords according to the number of occurrences, when the number of times the same keyword is marked exceeds a preset Threshold and the keyword is not in the standard dictionary, then import the keyword into the standard dictionary;
  • the search association module 3 is used for using the method of machine learning to match the extracted keywords in the standard vocabulary, and compose the matched words containing the keywords into the second text information, Prioritizing the degree of relevance of the second text information;
  • the keyword processing module 2 includes a priority storage unit for setting a preferred storage area in the standard vocabulary, and the preferred storage area is used to store the candidate association content with the highest priority.
  • the associative content corresponds to the keyword whose appearance count exceeds a preset threshold.
  • searching for keywords multiple matching items in the thesaurus are searched according to keywords, and several matching items with a high degree of relevance are used as final output options; in addition, in the keyword search and decoding, A decoding buffer is set in the priority storage unit, and the decoding results with more occurrences of keywords are recorded, and at the same time, the relevance value of the selected alternatives is improved.
  • the audio output system provided in the above embodiment only uses the division of the above functional modules for example when the audio is output.
  • the above functions can be allocated by different functional modules according to needs.
  • the internal structure of the audio output system is divided into different functional modules to complete all or part of the functions described above.
  • the audio output system provided in the foregoing embodiment and the audio output method embodiment belong to the same concept. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

An audio output method and system. The method comprises the following steps: obtaining continuous voice information collected by a client, segmenting the voice information, decoding the segmented voice information, converting the voice information into first text information, and storing the first text information (S1); filtering the first text information, then extracting a keyword, marking the keyword according to the number of times that the keyword appears, and when the number of times that the same keyword appears exceeds a preset threshold and the keyword is not in a standard word library, importing the keyword into the standard word library (S2); using a machine learning method to search for the extracted keyword and then perform matching, enabling matched characters containing the keyword to form second text information, and arranging the second text information (S3); and converting the second text information into a corresponding standard audio according to a priority order, and outputting the standard audio (S4). The present invention can recognize audio information, rapidly and accurately output processed audio information, and improve the effect of language expression.

Description

一种音频的输出方法和系统Audio output method and system 技术领域Technical field
本发明涉及语音识别领域,具体涉及一种音频的输出方法和系统。The invention relates to the field of speech recognition, in particular to an audio output method and system.
背景技术Background technique
音频是促进人与人、人与发声体之间有效交互的必要手段,现有的知识资讯、互动娱乐、情感表达等多数是以音频为媒介来进行信息传输的,这些都无一例外的体现了音频具有沟通高效性、易于接受等特点。随着信息化时代的发展,音频需要搭载的信息越来越多,此时,需要通过某种方式将此段音频交互出去。但是,对于部分人群来说,将说出去的话清楚、准确的传递出去,还是有些困难的,例如,部分人的英语水平有限,可能发音不标准、重复词也较多,说出去的英语可能达不到内心的要求。还比如,对于部分人群来说,由于说话中的出现有错别字、语速过快等问题,可能对于听众来说,这种说话方式难以让人理解,甚至产生误解,从而造成语音信息的遗漏或者关键点的丢失甚至是无用信息的重复。Audio is a necessary means to promote effective interaction between people and between people and the sounding body. Most of the existing knowledge, information, interactive entertainment, emotional expression, etc. use audio as a medium for information transmission, and these are all manifested without exception. The audio has the characteristics of efficient communication and easy acceptance. With the development of the information age, audio needs to carry more and more information. At this time, it is necessary to interact with this audio in some way. However, for some people, it is still somewhat difficult to convey what is said clearly and accurately. For example, some people have limited English proficiency, and may have substandard pronunciation and repeated words. The spoken English may be as good as possible. Less than the inner request. For example, for some people, due to problems such as typos and too fast speaking, it may be difficult for the listeners to understand this way of speaking, or even cause misunderstandings, resulting in the omission of voice information or The loss of key points is even the duplication of useless information.
现有的音频提示方式是通过提词器等文本方式来促进交互,但其方式缺少交互的流畅性,达不到语言表达的节奏性效果,得不到期望的交互反馈,从某种程度上来说,甚至会失去有效信息的传递。因此,如何提供一种输出效率更高的音频输出方式,提升用户间的交互体验,成为语音输出方面亟需解决的问题。The existing audio prompt method promotes interaction through text methods such as teleprompter, but its method lacks the fluency of interaction, does not achieve the rhythmic effect of language expression, and does not get the expected interactive feedback. To a certain extent Said, even the transmission of effective information will be lost. Therefore, how to provide an audio output method with higher output efficiency and improve the interactive experience between users has become an urgent problem in voice output.
发明内容Summary of the invention
为了解决现有技术的问题,本发明实施例提供了一种音频的输出方法和系统,能够自动对发出的音频信息进行智能识别,并快速、精确地输出处理后的 音频提示信息,提高语言表达的节奏效果。In order to solve the problems of the prior art, the embodiments of the present invention provide an audio output method and system, which can automatically recognize the sent audio information intelligently, and quickly and accurately output the processed audio prompt information to improve language expression Rhythm effect.
为解决上述技术问题,本发明采用的技术方案是:In order to solve the above technical problems, the technical solutions adopted by the present invention are:
第一方面,本发明实施例提供了一种音频的输出方法,包括以下步骤:In the first aspect, an embodiment of the present invention provides an audio output method, including the following steps:
获取客户端采集的持续的语音信息,将所述语音信息进行分段切割,对分段切割后的所述语音信息进行解码,再将切割后的所述语音信息转化为第一文本信息并存储;Acquire continuous voice information collected by the client, segment the voice information, decode the segmented voice information, and then convert the cut voice information into first text information and store it ;
对所述第一文本信息进行过滤,然后进行关键词的提取,对所述关键词按照出现的次数进行标记,当同一个所述关键词标记的次数超过预设阈值且所述关键词未在标准词库中,则将所述关键词导入标准词库;The first text information is filtered, and then keywords are extracted. The keywords are marked according to the number of occurrences. When the number of times the same keyword is marked exceeds a preset threshold and the keyword is not in In the standard lexicon, import the keywords into the standard lexicon;
利用机器学习的方法,对提取后的所述关键词在所述标准词库中搜索后进行匹配,将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列;Using the method of machine learning, the extracted keywords are searched in the standard vocabulary and then matched, the matched words containing the keywords are formed into second text information, and the second text information is performed Prioritization of relevance;
按照优先级的顺序将所述第二文本信息转化为相对应的标准音频并输出。According to the order of priority, the second text information is converted into corresponding standard audio and output.
进一步地,所述语音信息的分段切割包括以下步骤:对所述语音信息按照时间戳的序列通过脉冲编码调制变为语音编码数据,对连续两段固定时间长度的所述语音编码数据进行相似度检测,将所述语音编码数据的重复拼接区域进行最适端点标记,切割筛选所述语音编码数据的重复数据。Further, the segmentation of the voice information includes the following steps: the voice information is converted into voice coded data by pulse code modulation according to a sequence of timestamps, and two consecutive segments of the voice coded data of a fixed time length are similarly processed. Degree detection, marking the most suitable end point of the repetitive splicing area of the voice coded data, and cutting and filtering the repetitive data of the voice coded data.
进一步地,所述第一文本信息的过滤至少包括:对语气词的识别和重复词的筛选。Further, the filtering of the first text information includes at least: recognition of modal particles and filtering of repeated words.
进一步地,所述标准词库中设置有优选存储区,所述优选存储区用于存储优先级最高的备选联想内容,所述备选联想内容对应于出现次数超过预设阈值的所述关键词。Further, a preferred storage area is provided in the standard vocabulary, and the preferred storage area is used to store the candidate association content with the highest priority, and the candidate association content corresponds to the key whose number of occurrences exceeds a preset threshold. word.
进一步地,对所述关键词的匹配前还包括通过模糊算法对所述关键词的处理,对所述关键词搜索包括搜索所述关键词的同近义词、同音义词、低比例错字;所述优先级排列的方法至少包括贝叶斯决策方法。Further, before matching the keyword, it also includes processing the keyword through a fuzzy algorithm, and searching for the keyword includes searching for synonyms, homonyms, and low-proportion typos of the keyword; Prioritization methods include at least Bayesian decision-making methods.
进一步地,所述标准音频的输出对应于不同客户端的应用场所,当为单侧 客户端时,输出为可供选择的至少三条按照关联度高低排列的所述标准音频;当为双侧客户端时,输出为关联度最高的所述标准音频。Further, the output of the standard audio corresponds to the application sites of different clients. When it is a single-sided client, the output is at least three optional standard audios arranged according to the degree of relevance; when it is a double-sided client When, the output is the standard audio with the highest degree of relevance.
另一方面,本发明实施例还提供了一种音频的输出系统,包括:On the other hand, an embodiment of the present invention also provides an audio output system, including:
采集输入模块,用于获取客户端采集的持续的语音信息,将所述语音信息进行分段切割,对分段切割后的所述语音信息进行解码,再将切割后的所述语音信息转化为第一文本信息并存储;The collection and input module is used to obtain the continuous voice information collected by the client, segment the voice information, decode the voice information after the segmentation, and then convert the voice information after the segmentation into And store the first text information;
关键词处理模块,用于对所述第一文本信息进行过滤,然后进行关键词的提取,对所述关键词按照出现的次数进行标记,当同一个所述关键词标记的次数超过预设阈值且所述关键词未在标准词库中,则将所述关键词导入标准词库;The keyword processing module is used to filter the first text information, then extract keywords, and mark the keywords according to the number of occurrences, when the number of times the same keyword is marked exceeds a preset threshold And the keyword is not in the standard thesaurus, then the keyword is imported into the standard thesaurus;
搜索关联模块,用于通过机器学习的方法,对提取后的所述关键词在所述标准词库中搜索后进行匹配,将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列;The search association module is used for matching the extracted keywords in the standard dictionary through the method of machine learning, and composing the text containing the keywords after the matching into the second text information. The second text information is arranged in priority order of relevance;
音频输出模块,用于按照优先级的顺序将所述第二文本信息转化为相对应的标准音频并输出。The audio output module is used for converting the second text information into corresponding standard audio and outputting it in the order of priority.
进一步地,所述采集输入模块包括语音切割单元,所述语音切割单元用于将所述语音信息按照时间戳的序列通过脉冲编码调制变为语音编码数据,对连续两段固定时间长度的所述语音编码数据进行相似度检测,将所述语音编码数据的重复拼接区域进行最适端点标记,切割筛选所述语音编码数据的重复数据。Further, the acquisition and input module includes a voice cutting unit, the voice cutting unit is used to convert the voice information into voice coded data through pulse code modulation according to a sequence of time stamps. Similarity detection is performed on the voice coded data, the repetitive splicing area of the voice coded data is marked with the most suitable endpoint, and the repetitive data of the voice coded data is cut and filtered.
进一步地,所述关键词处理模块包括优先存储单元,用于在所述标准词库中设置优选存储区,所述优选存储区用于存储优先级最高的备选联想内容,所述备选联想内容对应于出现次数超过预设阈值的所述关键词。Further, the keyword processing module includes a priority storage unit for setting a preferred storage area in the standard thesaurus, the preferred storage area is used to store the candidate association content with the highest priority, the candidate association The content corresponds to the keyword whose number of appearances exceeds a preset threshold.
进一步地,所述搜索关联模块包括有智能搜索单元和优先排序单元,所述智能搜索单元对所述关键词的搜索包括搜索所述关键词的同近义词、同音义词、低比例错字,所述优先排序单元进行的所述优先级排列至少包括贝叶斯决策方法。Further, the search association module includes an intelligent search unit and a prioritization unit. The search for the keyword by the intelligent search unit includes searching for synonyms, homonyms, and low-proportion typos of the keyword. The priority ranking performed by the prioritizing unit includes at least a Bayesian decision method.
本发明实施例提供的技术方案带来的有益效果是:The beneficial effects brought about by the technical solutions provided by the embodiments of the present invention are:
本发明实施例公开的一种音频的输出方法和系统,通过对用户的语音信息先进行采集,并对采集后的语音信息按照时间戳的序列进行分段切割和转化为第一文本信息后,对所述第一文本信息进行过滤和关键词的提取,利用所述关键词在标准词库中进行搜索出关联度高备选联想内容,同时,利用机器学习的方法,对高频的关键词进行强化学习,使得所述标准词库能够不断存储对应优先级高的备选联想内容,通过贝叶斯决策方法将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列,再按照优先级的顺序对所述第二文本信息转化为相对应的标准音频,从而更加应用场景的不同,最终输入和用户最初发出的语音信息相似的标准音频。本发明实施例能够避免用户在通过终端交流过程中的发音不准、发音错误的情形,还能够通过机器学习的方法,通过关键词的提取,智能联想出音频发出者的想要表达的语音,通过对交流方式的机器学习,提升用户沟通时的交互体验。且本发明实施例对音频的输出方法实现方式简便、快速且可塑性高,应用场景广泛。The audio output method and system disclosed in the embodiments of the present invention first collect the user's voice information, and after the collected voice information is segmented and converted into the first text information according to the sequence of time stamps, The first text information is filtered and keywords are extracted, and the keywords are used to search for high-relevance candidate association content in the standard thesaurus. At the same time, the method of machine learning is used to search for high-frequency keywords. Reinforcement learning is performed so that the standard vocabulary can continuously store the candidate association content corresponding to the high priority, and the matching text containing the keyword is formed into the second text information through the Bayesian decision method, and the second text information is The text information is prioritized by the degree of relevance, and then the second text information is converted into the corresponding standard audio according to the order of priority, so that the application scenarios are different, and the final input is similar to the original voice information sent by the user. Audio. The embodiment of the present invention can avoid the situation of inaccurate pronunciation and wrong pronunciation of users in the process of communicating through the terminal, and can also intelligently associate the voice that the audio sender wants to express through the method of machine learning and the extraction of keywords. Through the machine learning of communication methods, the interactive experience of users when communicating is improved. In addition, the implementation of the audio output method in the embodiment of the present invention is simple, fast, and highly flexible, and has a wide range of application scenarios.
附图说明Description of the drawings
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present invention, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained from these drawings without creative work.
图1是本发明实施例提供了音频的输出方法的一种流程示意图;FIG. 1 is a schematic flowchart of an audio output method provided by an embodiment of the present invention;
图2是本发明实施例提供了音频的输出方法中关于词库搜索的流程示意图;FIG. 2 is a schematic diagram of the process of searching the lexicon in the audio output method provided by the embodiment of the present invention;
图3是本发明实施例提供了音频的输出系统的一种结构示意图。Fig. 3 is a schematic structural diagram of an audio output system provided by an embodiment of the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描 述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the objectives, technical solutions and advantages of the present invention clearer, the following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only A part of the embodiments of the present invention, but not all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
实施例一:Example one:
如图1所示,本发明实施例提供了一种音频的输出方法,包括以下步骤:As shown in FIG. 1, an embodiment of the present invention provides an audio output method, which includes the following steps:
S1:获取客户端采集的持续的语音信息,将所述语音信息进行分段切割,对分段切割后的所述语音信息进行解码,再将切割后的所述语音信息转化为第一文本信息并存储;S1: Acquire continuous voice information collected by the client, segment the voice information, decode the segmented voice information, and then convert the segmented voice information into first text information And store
S2:对所述第一文本信息进行过滤,然后进行关键词的提取,对所述关键词按照出现的次数进行标记,当同一个所述关键词标记的次数超过预设阈值且所述关键词未在标准词库中,则将所述关键词导入标准词库;S2: Filter the first text information, then extract keywords, mark the keywords according to the number of occurrences, when the number of times the same keyword is marked exceeds a preset threshold and the keywords If it is not in the standard dictionary, then import the keywords into the standard dictionary;
S3:利用机器学习的方法,对提取后的所述关键词在所述标准词库中搜索后进行匹配,将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列;S3: Using the method of machine learning, the extracted keywords are searched in the standard vocabulary and then matched, the matched words containing the keywords are formed into second text information, and the second text Prioritize the relevance of information;
S4:按照优先级的顺序将所述第二文本信息转化为相对应的标准音频并输出。S4: Convert the second text information into corresponding standard audio and output it in the order of priority.
具体地,本实施例通过客户端对用户的语音信息先进行采集,并对采集后的语音信息按照时间戳的序列进行分段切割和转化为第一文本信息后,对所述第一文本信息进行过滤和关键词的提取,利用所述关键词在标准词库中进行搜索出关联度高备选联想内容,同时,利用机器学习的方法,对高频的关键词进行强化学习,使得所述标准词库能够不断存储对应优先级高的备选联想内容,通过贝叶斯决策方法将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列,再按照优先级的顺序对所述第二文本信息转化为相对应的标准音频,从而更加应用场景的不同,最终输入和用户最初发出的语音信息相似的标准音频。本发明实施例能够避免用户在通过终端交流过程中的发音不准、发音错误的情形,通过对交流方式的机器学习,提升 用户沟通时的交互体验。且本发明实施例对音频的输出方法实现方式简便、快速和,应用场景广泛。Specifically, in this embodiment, the user's voice information is first collected through the client, and the collected voice information is segmented and converted into the first text information according to the sequence of time stamps, and then the first text information is Perform filtering and keyword extraction, use the keywords to search for high-relevance candidate association content in the standard thesaurus, and at the same time, use machine learning methods to perform reinforcement learning on high-frequency keywords, so that the The standard thesaurus can continuously store the candidate association content corresponding to the high priority, and use the Bayesian decision method to compose the matched text containing the keyword into the second text information, and prioritize the relevance of the second text information. The second text information is converted into the corresponding standard audio according to the priority order, so that the application scenarios are different, and the final input is the standard audio similar to the voice information originally sent by the user. The embodiment of the present invention can avoid the situation that the user's pronunciation is inaccurate or wrong in the process of communicating through the terminal, and through the machine learning of the communication mode, the interaction experience of the user when communicating is improved. In addition, the implementation of the audio output method in the embodiment of the present invention is simple, fast, and has a wide range of application scenarios.
在采集用户端传送过来的音频数据时,音频数据长度固定,例如每段采集的音频长度为15秒,但是单次采样间隔时间固定为10秒,这样对于每段采集到的音频会存在5秒钟的音频重叠区,为了对重叠区的音频进行剔除,需要通过对音频对应的语音信息进行分段切割,优选地,所述语音信息的分段切割包括以下步骤:对所述语音信息按照时间戳的序列通过脉冲编码调制变为语音编码数据,对连续两段固定时间长度的所述语音编码数据进行相似度检测,将所述语音编码数据的重复拼接区域进行最适端点标记,切割筛选所述语音编码数据的重复数据。When collecting audio data from the client, the length of the audio data is fixed. For example, the length of each piece of collected audio is 15 seconds, but the single sampling interval is fixed at 10 seconds, so that there will be 5 seconds for each piece of collected audio In the audio overlap area of the clock, in order to eliminate the audio in the overlap area, the voice information corresponding to the audio needs to be segmented and cut. Preferably, the segmentation of the voice information includes the following steps: The sequence of stamps is converted into speech coded data by pulse code modulation, the similarity of the speech coded data of two consecutive fixed time lengths is detected, the repetitive splicing area of the speech coded data is marked with the most suitable endpoint, and the selection is cut and filtered. The repeated data of the speech coded data.
具体地,所述标准词库,包括所有关键词对应的输出信息;优选地,所述标准词库中设置有优选存储区,所述优选存储区用于存储优先级最高的备选联想内容,所述备选联想内容对应于出现次数超过预设阈值的所述关键词。对于在对关键词搜索的过程中,是根据关键词搜索词库中多个匹配项,以关联度高的若干所述匹配项作为最终输出备选项;对于步骤S2和S3的关键词处理和搜索过程,可以简称为关键词搜索解码的过程。其中,该关键词搜索解码中,设置解码缓冲区,对关键词出现次数越多的解码结果进行记录,同时提高所选备选项的相关度值。Specifically, the standard vocabulary includes output information corresponding to all keywords; preferably, a preferred storage area is provided in the standard vocabulary, and the preferred storage area is used to store the candidate association content with the highest priority. The candidate association content corresponds to the keyword whose number of appearances exceeds a preset threshold. In the process of searching for keywords, multiple matching items in the thesaurus are searched according to keywords, and several matching items with a high degree of relevance are used as final output options; for the keyword processing and searching in steps S2 and S3 The process can be referred to simply as the process of keyword search and decoding. Among them, in the keyword search and decoding, a decoding buffer is set to record the decoding results with more occurrences of the keyword, and at the same time increase the relevance value of the selected alternative.
如图2是关于步骤S3的进一步说明,其中具体包括以下步骤:Figure 2 is a further description of step S3, which specifically includes the following steps:
S301:将过滤后的分词在搜索词库中所述关键词,对于不存在的分词进行词性和频次是否大于三进行判断是否添加进入词库中;S301: Search the filtered word segmentation for the keywords in the thesaurus, and determine whether the part of speech and frequency of the non-existent word segmentation are greater than three to determine whether to add it into the thesaurus;
S302:利用深度强化学习,对分词进行关键词相似匹配,包括同义词、同音错字、低比例错字纠正等匹配功能;S302: Use deep reinforcement learning to perform similar keyword matching for word segmentation, including matching functions such as synonyms, homophone typos, and low-proportion typos correction;
S303:根据贝叶斯决策机制,统计高频次的关键词对应内容以及出现的时间,制定优先级搜索顺序;S303: According to the Bayesian decision mechanism, count the corresponding content and the time of occurrence of the high-frequency keywords, and formulate the priority search order;
S304;根据优先级规则,依次输出优先级高对应的备选项内容。S304: According to the priority rule, output the alternative content corresponding to the higher priority in sequence.
优选地,所述第一文本信息的过滤至少包括:对语气词的识别和重复词的筛选。该关键词搜索解码中,所述解码缓冲区中还包括用户习惯表达术语,例如音频中连续重复多次(三次及以上)的表达术语。此外,该关键词搜索解码中,整个词库中关联度由高到低排列为:解码缓冲区、关键词搜索、近(同)义词搜索。该关键词搜索解码中,存在对非标准(连续三次)重复音频、无效音频区间统一调整处理,不上传。例如,将识别的文本结果通过分词处理,对于一些存在的语气词,比如“嗯嗯嗯”、“啊啊啊”、“哦哦哦”、“呀呀呀”等,以及单一重复词:“行行行”、“对对对”、“哈哈哈”、“啦啦啦”等,以及一些回答性语句:“是的是的是的”、“好的好的好的”、“对的对的对的”等,以及口语式的词:“那个那个那个”、“六六六”、“哦耶哦耶哦耶”等,这些词进行过滤处理,不作为关键词搜索输入。Preferably, the filtering of the first text information includes at least: recognition of modal particles and filtering of repeated words. In the keyword search and decoding, the decoding buffer also includes user-accustomed expression terms, such as expression terms repeated multiple times (three times or more) consecutively in the audio. In addition, in the keyword search and decoding, the correlation degree in the entire thesaurus is arranged from high to low as: decoding buffer, keyword search, and near (synonymous) word search. In the keyword search and decoding, there is a unified adjustment process for non-standard (three consecutive times) repeated audio and invalid audio intervals, without uploading. For example, the recognized text results are processed by word segmentation, for some existing modal particles, such as "umhhhhh", "ahhhhh", "ohhhhh", "yahhhh", etc., as well as single repeated words: " trekking trip "" Yes, yes "," ha ha ha "," la la la "and so on, as well as some of the statements replied:" Yes Yes Yes "," good good good "," right Right, right, etc., as well as colloquial words: "that that that", "six six six", "oh ye oh ye oh yeah", etc. These words are filtered and not entered as keyword searches.
具体来说,对所述关键词的匹配前还包括通过模糊算法对所述关键词的处理,搜索所述关键词包括搜索关键词的同近义词、同音义词、低比例错字;所述优先级排列的方法至少包括贝叶斯决策方法。其中,该关键词搜索解码中,定义识别当前音频内容的同义词和近义词来匹配词库中的关键词,此时优先级降一个等级;在该关键词搜索解码中,对识别内容进行智能模糊匹配,例如“一旦被蛇咬”识别成“一朝被蛇咬”,“四四五入”识别成“四舍五入”等;Specifically, before matching the keyword, it also includes processing the keyword through a fuzzy algorithm, and searching for the keyword includes synonyms, homophones, and low-proportion typos of the search keyword; the priority The arrangement method includes at least Bayesian decision method. Among them, in the keyword search and decoding, define the synonyms and synonyms of the current audio content to match the keywords in the thesaurus, at this time the priority is reduced by one level; in the keyword search and decoding, intelligent fuzzy matching of the recognized content , For example, "Once bitten by a snake" is recognized as "being bitten by a snake once", and "four, four, five" is recognized as "rounding", etc.;
优选地,所述标准音频的输出对应于不同客户端的应用场所,当为单侧客户端时,输出为可供选择的至少三条按照关联度高低排列的所述标准音频;当为双侧客户端时,输出为关联度最高的所述标准音频。当对应的客户端为同一个设备时,此时为单侧客户端,在对客户端的输出可以进行选择,因为这时的输入和输入为同一个音频发出者。具体来说,对于输出备选项文本,贝叶斯决策机制约束条件下,对备选项本文进行优先级排序,定义优先级从高到低排列:出现频次较高且已选过的备选项内容(第一条)、当前关键词搜索备选项内容(第一条)、后续关键词对应备选项内容(第一条)、出现频次较高且已选过的备选项内容(第二条)、当前关键词搜索备选项内容(第二条)、后续关键词对应备 选项内容(第二条),…,依此类推;然后过滤掉重复备选文本,综合判断关联度最高的前三条备选文本,将这三条备选项文本生成对应的音频,例如,每条音频长度不超过5s(否则只取5s前的音频),然后将该段音频转换成MP3格式的音频,依次经过进一步的处理后传送到客户端,如果该段时间戳的音频与备选项音频相似度达到某个阈值的时候,此时清空与该备选项音频同时间戳的所有备选音频数据。当为两个不同的客户端时,也即是音频的输入端为音频的发出者,音频的输出端为音频的接收者时,此时,对于音频的发出只有一个关联度最高的所述标准音频,此时通过客户端输出的标准音频就默认为音频的发出者的输入音频。Preferably, the output of the standard audio corresponds to the application places of different clients. When it is a single-sided client, the output is at least three optional standard audios arranged according to the degree of relevance; when it is a double-sided client When, the output is the standard audio with the highest degree of relevance. When the corresponding client is the same device, it is a single-sided client at this time, and the output of the client can be selected because the input and input at this time are the same audio sender. Specifically, for the output alternative text, under the constraints of the Bayesian decision mechanism, the alternative text is sorted by priority, and the priority is defined in order from high to low: the content of the alternatives that appear more frequently and have been selected ( The first item), the content of the current keyword search alternative (the first item), the content of the alternative item corresponding to the subsequent keyword (the first item), the content of the alternative item that appears frequently and has been selected (the second item), the current Keyword search for alternative content (the second item), subsequent keywords corresponding to the alternative content (second item), ..., and so on; then filter out repeated alternative texts, and comprehensively judge the first three alternative texts with the highest relevance , Generate the corresponding audio from these three alternative texts, for example, each audio length does not exceed 5s (otherwise only the audio before 5s), and then convert the audio into MP3 format audio, and then transmit it after further processing To the client, if the similarity between the audio of the time stamp and the alternative audio reaches a certain threshold, all alternative audio data with the same time stamp as the alternative audio are cleared at this time. In the case of two different clients, that is, when the audio input terminal is the audio sender and the audio output terminal is the audio receiver, at this time, there is only one standard with the highest degree of relevance for the audio sending Audio, the standard audio output through the client at this time is the input audio of the audio sender by default.
实施例二:Embodiment two:
如图3所示,本发明实施例还提供了一种音频的输出系统,包括:As shown in FIG. 3, an embodiment of the present invention also provides an audio output system, including:
采集输入模块1,用于获取客户端5采集的持续的语音信息,将所述语音信息进行分段切割,对分段切割后的所述语音信息进行解码,再将切割后的所述语音信息转化为第一文本信息并存储;The collection and input module 1 is used to obtain the continuous voice information collected by the client terminal 5, segment the voice information, decode the voice information after the segmentation, and then convert the voice information after the segmentation Convert it into the first text message and store it;
关键词处理模块2,用于对所述第一文本信息进行过滤,然后进行关键词的提取,对所述关键词按照出现的次数进行标记,当同一个所述关键词标记的次数超过预设阈值且所述关键词未在标准词库中,则将所述关键词导入标准词库;The keyword processing module 2 is used for filtering the first text information, and then extracting keywords, marking the keywords according to the number of occurrences, when the number of times the same keyword is marked exceeds a preset Threshold and the keyword is not in the standard dictionary, then import the keyword into the standard dictionary;
搜索关联模块3,用于利用机器学习的方法,对提取后的所述关键词在所述标准词库中搜索后进行匹配,将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列;The search association module 3 is used for using the method of machine learning to match the extracted keywords in the standard vocabulary, and compose the matched words containing the keywords into the second text information, Prioritizing the degree of relevance of the second text information;
音频输出模块4,用于按照优先级的顺序将所述第二文本信息转化为相对应的标准音频并输出。The audio output module 4 is used to convert the second text information into a corresponding standard audio and output it in the order of priority.
优选地,所述采集输入模块1包括语音切割单元,所述语音切割单元用于将所述语音信息按照时间戳的序列通过脉冲编码调制变为语音编码数据,对连续两段固定时间长度的所述语音编码数据进行相似度检测,将所述语音编码数据的重复拼接区域进行最适端点标记,切割筛选所述语音编码数据的重复数据。 例如,在采集用户端传送过来的音频数据时,音频数据长度固定,例如每段采集的音频长度为15秒,但是单次采样间隔时间固定为10秒,这样对于每段采集到的音频会存在5秒钟的音频重叠区,为了对重叠区的音频进行剔除,需要通过对音频对应的语音信息进行分段切割。Preferably, the acquisition and input module 1 includes a voice cutting unit, which is used to convert the voice information into voice coded data through pulse code modulation according to a sequence of time stamps. Perform similarity detection on the voice coded data, mark the repetitive splicing area of the voice coded data with the most suitable endpoint, and cut and filter the repetitive data of the voice coded data. For example, when collecting audio data transmitted from the client, the audio data length is fixed. For example, the length of each piece of collected audio is 15 seconds, but the single sampling interval is fixed at 10 seconds, so that there will be some audio data for each piece of collected audio. In the audio overlap area of 5 seconds, in order to eliminate the audio in the overlap area, the voice information corresponding to the audio needs to be segmented.
优选地,所述关键词处理模块2包括优先存储单元,用于在所述标准词库中设置优选存储区,所述优选存储区用于存储优先级最高的备选联想内容,所述备选联想内容对应于出现次数超过预设阈值的所述关键词。其中,对于在对关键词搜索的过程中,是根据关键词搜索词库中多个匹配项,以关联度高的若干所述匹配项作为最终输出备选项;此外,该关键词搜索解码中,在所述优先存储单元设置解码缓冲区,对关键词出现次数越多的解码结果进行记录,同时提高所选备选项的相关度值。Preferably, the keyword processing module 2 includes a priority storage unit for setting a preferred storage area in the standard vocabulary, and the preferred storage area is used to store the candidate association content with the highest priority. The associative content corresponds to the keyword whose appearance count exceeds a preset threshold. Among them, in the process of searching for keywords, multiple matching items in the thesaurus are searched according to keywords, and several matching items with a high degree of relevance are used as final output options; in addition, in the keyword search and decoding, A decoding buffer is set in the priority storage unit, and the decoding results with more occurrences of keywords are recorded, and at the same time, the relevance value of the selected alternatives is improved.
优选地,所述关键词处理模块2用于对所述第一文本信息的过滤至少包括:对语气词的识别和重复词的筛选。所述关键词处理模块2对关键词搜索解码中,存在对非标准(连续三次)重复音频、无效音频区间统一调整处理,不上传。例如,将识别的文本结果通过分词处理,对于一些存在的语气词,比如“嗯嗯嗯”、“啊啊啊”、“哦哦哦”、“呀呀呀”等,以及单一重复词:“行行行”、“对对对”、“哈哈哈”、“啦啦啦”等,以及一些回答性语句:“是的是的是的”、“好的好的好的”、“对的对的对的”等,以及口语式的词:“那个那个那个”、“六六六”、“哦耶哦耶哦耶”等,这些词进行过滤处理,不作为关键词搜索输入。Preferably, the keyword processing module 2 for filtering the first text information includes at least: recognition of modal particles and filtering of repeated words. When the keyword processing module 2 searches and decodes keywords, there is a unified adjustment process for non-standard (three consecutive times) repeated audio and invalid audio intervals without uploading. For example, the recognized text result is processed through word segmentation, for some existing modal particles, such as "umhhhh", "ahhhhh", "ohhhhh", "yahyah", etc., as well as single repeated words: trekking trip "" Yes, yes "," ha ha ha "," la la la "and so on, as well as some of the statements replied:" Yes Yes Yes "," good good good "," right Right, right, etc., as well as colloquial words: "that that that", "six six six", "oh ye oh ye oh yeah", etc. These words are filtered and not entered as keyword searches.
优选地,所述搜索关联模块3包括有智能搜索单元和优先排序单元,所述智能搜索单元用于对所述关键词的匹配,包括搜索所述关键词的同近义词、同音义词、低比例错字,所述优先排序单元进行的所述优先级排列至少包括贝叶斯决策方法。具体来说,所述搜索关联模块3对所述关键词的匹配前还包括通过模糊算法对所述关键词的处理,搜索所述关键词包括关键词的同近义词、同音义词、低比例错字;其中,该关键词搜索解码中,定义识别当前音频内容的同义词和近义词来匹配词库中的关键词,此时优先级降一个等级;在该关键词 搜索解码中,对识别内容进行智能模糊匹配,例如“一旦被蛇咬”识别成“一朝被蛇咬”,“四四五入”识别成“四舍五入”等。Preferably, the search association module 3 includes an intelligent search unit and a prioritization unit. The intelligent search unit is used to match the keywords, including searching for synonyms, homophones, and low-proportion words of the keywords. A typo, the priority ranking performed by the prioritizing unit includes at least a Bayesian decision method. Specifically, before matching the keyword by the search association module 3, it also includes processing the keyword through a fuzzy algorithm, and searching for the keyword includes synonyms, homonyms, and low-proportion typos of the keyword. ; Among them, in the keyword search and decoding, define synonyms and synonyms that identify the current audio content to match the keywords in the thesaurus, at this time the priority is reduced by one level; in the keyword search and decoding, the recognition content is intelligently fuzzy Matching, for example, "Once bitten by a snake" is recognized as "Snake bitten once", "Four, Four, and Five" is recognized as "Rounding" and so on.
如图3所示的音频的输出系统的一种结构示意图,图中的客户端5为同一个设备,其中,对于不同的应用场合,也可以将客户端5设置为两个不同的设备,一个为音频的输入端,另一个为音频的输出端,这两个客户端分别为两个不同的用户持有,这样就直接可以通过两个客户端直接进行交流,能够避免用户在通过终端交流过程中的发音不准、发音错误的情形,通过对交流方式的机器学习,提升用户沟通时的交互体验。当为单侧客户端时,输出为可供选择的至少三条按照关联度高低排列的所述标准音频;当为双侧客户端时,输出为关联度最高的所述标准音频。当对应的客户端为同一个设备时,此时为单侧客户端,在对客户端的输出可以进行选择,因为这时的输入和输入为同一个音频发出者。具体来说,对于输出备选项文本,贝叶斯决策机制约束条件下,对备选项本文进行优先级排序,定义优先级从高到低排列:出现频次较高且已选过的备选项内容(第一条)、当前关键词搜索备选项内容(第一条)、后续关键词对应备选项内容(第一条)、出现频次较高且已选过的备选项内容(第二条)、当前关键词搜索备选项内容(第二条)、后续关键词对应备选项内容(第二条),…,依此类推;然后过滤掉重复备选文本,综合判断关联度最高的前三条备选文本,将这三条备选项文本生成对应的音频,例如,每条音频长度不超过5s(否则只取5s前的音频),然后将该段音频转换成MP3格式的音频,依次经过进一步的处理后传送到客户端5,如果该段时间戳的音频与备选项音频相似度达到某个阈值的时候,此时清空与该备选项音频同时间戳的所有备选音频数据。当为两个不同的客户端时,也即是音频的输入端为音频的发出者,音频的输出端为音频的接收者时,此时,对于音频的发出只有一个关联度最高的所述标准音频,此时通过客户端输出的标准音频就默认为音频的发出者的输入音频。A structural schematic diagram of the audio output system shown in Figure 3, the client 5 in the figure is the same device, where, for different applications, the client 5 can also be set as two different devices, one It is the audio input terminal and the other is the audio output terminal. These two clients are held by two different users, so that the two clients can directly communicate directly, which can prevent users from communicating through the terminal. Inaccurate pronunciation, incorrect pronunciation in the situation, through the machine learning of the communication method, to enhance the user's interactive experience when communicating. In the case of a single-sided client, the output is at least three optional standard audios arranged according to the degree of relevance; in the case of a double-sided client, the output is the standard audio with the highest degree of relevance. When the corresponding client is the same device, it is a single-sided client at this time, and the output of the client can be selected because the input and input at this time are the same audio sender. Specifically, for the output alternative text, under the constraints of the Bayesian decision mechanism, the alternative text is sorted by priority, and the priority is defined in order from high to low: the content of the alternatives that appear more frequently and have been selected ( The first item), the content of the current keyword search alternative (the first item), the content of the alternative item corresponding to the subsequent keyword (the first item), the content of the alternative item that appears frequently and has been selected (the second item), the current Keyword search for alternative content (the second item), subsequent keywords corresponding to the alternative content (second item), ..., and so on; then filter out repeated alternative texts, and comprehensively judge the first three alternative texts with the highest relevance , Generate the corresponding audio from these three alternative texts, for example, each audio length does not exceed 5s (otherwise only the audio before 5s), and then convert the audio into MP3 format audio, and then transmit it after further processing To the client terminal 5, if the similarity between the audio of the time stamp and the alternative audio reaches a certain threshold, all alternative audio data with the same time stamp as the alternative audio are cleared at this time. In the case of two different clients, that is, when the audio input terminal is the audio sender and the audio output terminal is the audio receiver, at this time, there is only one standard with the highest degree of relevance for the audio sending Audio, the standard audio output through the client at this time is the input audio of the audio sender by default.
上述所有可选技术方案,可以采用任意结合形成本发明的可选实施例,在此不再一一赘述。All the above optional technical solutions can be combined in any way to form an optional embodiment of the present invention, which will not be repeated here.
需要说明的是:上述实施例提供的音频的输出系统在音频的输出时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将音频的输出系统的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的音频的输出系统与音频的输出方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that the audio output system provided in the above embodiment only uses the division of the above functional modules for example when the audio is output. In actual applications, the above functions can be allocated by different functional modules according to needs. , The internal structure of the audio output system is divided into different functional modules to complete all or part of the functions described above. In addition, the audio output system provided in the foregoing embodiment and the audio output method embodiment belong to the same concept. For the specific implementation process, please refer to the method embodiment, which will not be repeated here.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person of ordinary skill in the art can understand that all or part of the steps in the above embodiments can be implemented by hardware, or by a program to instruct relevant hardware. The program can be stored in a computer-readable storage medium. The storage medium mentioned can be a read-only memory, a magnetic disk or an optical disk, etc.
以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above are only the preferred embodiments of the present invention and are not intended to limit the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included in the protection of the present invention. Within range.

Claims (10)

  1. 一种音频的输出方法,其特征在于,包括以下步骤:An audio output method, characterized in that it comprises the following steps:
    获取客户端采集的持续的语音信息,将所述语音信息进行分段切割,对分段切割后的所述语音信息进行解码,再将切割后的所述语音信息转化为第一文本信息并存储;Acquire continuous voice information collected by the client, segment the voice information, decode the segmented voice information, and then convert the cut voice information into first text information and store it ;
    对所述第一文本信息进行过滤,然后进行关键词的提取,对所述关键词按照出现的次数进行标记,当同一个所述关键词标记的次数超过预设阈值且所述关键词未在标准词库中,则将所述关键词导入标准词库;The first text information is filtered, and then keywords are extracted. The keywords are marked according to the number of occurrences. When the number of times the same keyword is marked exceeds a preset threshold and the keyword is not in In the standard lexicon, import the keywords into the standard lexicon;
    利用机器学习的方法,对提取后的所述关键词在所述标准词库中搜索后进行匹配,将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列;Using the method of machine learning, the extracted keywords are searched in the standard vocabulary and then matched, the matched words containing the keywords are formed into second text information, and the second text information is performed Prioritization of relevance;
    按照优先级的顺序将所述第二文本信息转化为相对应的标准音频并输出。According to the order of priority, the second text information is converted into corresponding standard audio and output.
  2. 根据权利要求1所述的音频的输出方法,其特征在于,所述语音信息的分段切割包括以下步骤:对所述语音信息按照时间戳的序列通过脉冲编码调制变为语音编码数据,对连续两段固定时间长度的所述语音编码数据进行相似度检测,将所述语音编码数据的重复拼接区域进行最适端点标记,切割筛选所述语音编码数据的重复数据。The audio output method according to claim 1, wherein the segmentation of the voice information includes the following steps: the voice information is converted into voice coded data by pulse code modulation according to a sequence of timestamps, and continuous Perform similarity detection on two segments of the voice coded data with a fixed length of time, mark the repetitive splicing area of the voice coded data with the most suitable endpoint, and cut and filter the repetitive data of the voice coded data.
  3. 根据权利要求1所述的音频的输出方法,其特征在于,所述第一文本信息的过滤至少包括:对语气词的识别和重复词的筛选。The audio output method according to claim 1, wherein the filtering of the first text information includes at least: recognition of modal particles and filtering of repeated words.
  4. 根据权利要求1所述的音频的输出方法,其特征在于,所述标准词库中设置有优选存储区,所述优选存储区用于存储优先级最高的备选联想内容,所述备选联想内容对应于出现次数超过预设阈值的所述关键词。The audio output method according to claim 1, wherein a preferred storage area is provided in the standard vocabulary, and the preferred storage area is used to store the candidate association content with the highest priority, and the candidate association The content corresponds to the keyword whose number of appearances exceeds a preset threshold.
  5. 根据权利要求1所述的音频的输出方法,其特征在于,对所述关键词的匹配前还包括通过模糊算法对所述关键词的处理,对所述关键词搜 索包括搜索所述关键词的同近义词、同音义词、低比例错字;所述优先级排列的方法至少包括贝叶斯决策方法。The audio output method according to claim 1, characterized in that, before matching the keyword, it further includes processing the keyword by a fuzzy algorithm, and searching for the keyword includes searching for the keyword. Synonyms, homophones, and low-proportion typos; the priority ranking method includes at least a Bayesian decision method.
  6. 根据权利要求1所述的音频的输出方法,其特征在于,所述标准音频的输出对应于不同客户端的应用场所,当为单侧客户端时,输出为可供选择的至少三条按照关联度高低排列的所述标准音频;当为双侧客户端时,输出为关联度最高的所述标准音频。The audio output method according to claim 1, wherein the output of the standard audio corresponds to the application sites of different clients, and when it is a single-sided client, the output is at least three options according to the degree of relevance. The standard audio arranged; when it is a two-sided client, the standard audio with the highest degree of relevance is output.
  7. 一种音频的输出系统,其特征在于,包括:An audio output system, characterized in that it comprises:
    采集输入模块,用于获取客户端采集的持续的语音信息,将所述语音信息进行分段切割,对分段切割后的所述语音信息进行解码,再将切割后的所述语音信息转化为第一文本信息并存储;The collection and input module is used to obtain the continuous voice information collected by the client, segment the voice information, decode the voice information after the segmentation, and then convert the voice information after the segmentation into And store the first text information;
    关键词处理模块,用于对所述第一文本信息进行过滤,然后进行关键词的提取,对所述关键词按照出现的次数进行标记,当同一个所述关键词标记的次数超过预设阈值且所述关键词未在标准词库中,则将所述关键词导入标准词库;The keyword processing module is used to filter the first text information, then extract keywords, and mark the keywords according to the number of occurrences, when the number of times the same keyword is marked exceeds a preset threshold And the keyword is not in the standard thesaurus, then the keyword is imported into the standard thesaurus;
    搜索关联模块,用于通过机器学习的方法,对提取后的所述关键词在所述标准词库中搜索后进行匹配,将匹配后含有所述关键词的文字组成第二文本信息,对所述第二文本信息进行关联度的优先级排列;The search association module is used for matching the extracted keywords in the standard dictionary through the method of machine learning, and composing the text containing the keywords after the matching into the second text information. The second text information is arranged in priority order of relevance;
    音频输出模块,用于按照优先级的顺序将所述第二文本信息转化为相对应的标准音频并输出。The audio output module is used for converting the second text information into corresponding standard audio and outputting it in the order of priority.
  8. 根据权利要求7所述的音频的输出系统,其特征在于,所述采集输入模块包括语音切割单元,所述语音切割单元用于将所述语音信息按照时间戳的序列通过脉冲编码调制变为语音编码数据,对连续两段固定时间长度的所述语音编码数据进行相似度检测,将所述语音编码数据的重复拼接区域进行最适端点标记,切割筛选所述语音编码数据的重复数据。The audio output system according to claim 7, wherein the collection and input module comprises a voice cutting unit, and the voice cutting unit is used to convert the voice information into voice through pulse code modulation according to a sequence of time stamps. For encoding data, similarity detection is performed on two consecutive segments of the voice coded data of a fixed time length, the repetitive splicing area of the voice coded data is marked with the most suitable endpoint, and the repetitive data of the voice coded data is cut and filtered.
  9. 根据权利要求7所述的音频的输出系统,其特征在于,所述关键词处理模块包括优先存储单元,用于在所述标准词库中设置优选存储区, 所述优选存储区用于存储优先级最高的备选联想内容,所述备选联想内容对应于出现次数超过预设阈值的所述关键词。The audio output system according to claim 7, wherein the keyword processing module includes a priority storage unit, configured to set a preferred storage area in the standard vocabulary, and the preferred storage area is used to store priority The candidate association content with the highest rank, and the candidate association content corresponds to the keyword whose number of appearances exceeds a preset threshold.
  10. 根据权利要求7所述的音频的输出系统,其特征在于,所述搜索关联模块包括有智能搜索单元和优先排序单元,所述智能搜索单元对所述关键词的搜索包括搜索所述关键词的同近义词、同音义词、低比例错字,所述优先排序单元进行的所述优先级排列至少包括贝叶斯决策方法。The audio output system according to claim 7, wherein the search association module includes an intelligent search unit and a prioritization unit, and the search for the keyword by the intelligent search unit includes a search for the keyword Synonyms, homophones, and low-proportion typos, the priority ranking performed by the prioritizing unit includes at least a Bayesian decision method.
PCT/CN2020/097000 2019-10-16 2020-06-19 Audio output method and system WO2021073138A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CA3158353A CA3158353A1 (en) 2019-10-16 2020-06-19 Audio-outputting method and system thereof

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910984646.0A CN110880316A (en) 2019-10-16 2019-10-16 Audio output method and system
CN201910984646.0 2019-10-16

Publications (1)

Publication Number Publication Date
WO2021073138A1 true WO2021073138A1 (en) 2021-04-22

Family

ID=69727955

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/097000 WO2021073138A1 (en) 2019-10-16 2020-06-19 Audio output method and system

Country Status (3)

Country Link
CN (1) CN110880316A (en)
CA (1) CA3158353A1 (en)
WO (1) WO2021073138A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363531A (en) * 2022-01-14 2022-04-15 中国平安人寿保险股份有限公司 H5-based case comment video generation method, device, equipment and medium

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110880316A (en) * 2019-10-16 2020-03-13 苏宁云计算有限公司 Audio output method and system
CN111540361B (en) * 2020-03-26 2023-08-18 北京搜狗科技发展有限公司 Voice processing method, device and medium
CN111524515A (en) * 2020-04-30 2020-08-11 海信电子科技(武汉)有限公司 Voice interaction method and device, electronic equipment and readable storage medium
CN112420034B (en) * 2020-09-14 2023-06-02 当趣网络科技(杭州)有限公司 Speech recognition method, system, electronic device and storage medium
CN113114416B (en) * 2021-03-30 2022-08-12 北海瀚迪智能科技有限公司 Information transmission system, method and computer storage medium
CN113326279A (en) * 2021-05-27 2021-08-31 阿波罗智联(北京)科技有限公司 Voice search method and device, electronic equipment and computer readable medium
CN113516986A (en) * 2021-07-23 2021-10-19 上海传英信息技术有限公司 Voice processing method, terminal and storage medium
CN114124860A (en) * 2021-11-26 2022-03-01 中国联合网络通信集团有限公司 Session management method, device, equipment and storage medium
CN115442273B (en) * 2022-09-14 2023-04-07 润芯微科技(江苏)有限公司 Voice recognition-based audio transmission integrity monitoring method and device
CN116860703B (en) * 2023-07-13 2024-04-16 杭州再启信息科技有限公司 Data processing system, method and storage medium based on artificial intelligence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106685A1 (en) * 2005-11-09 2007-05-10 Podzinger Corp. Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same
CN105069034A (en) * 2015-07-22 2015-11-18 无锡天脉聚源传媒科技有限公司 Recommendation information generation method and apparatus
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN107977443A (en) * 2017-12-10 2018-05-01 吴静 A kind of intelligent tutoring method and its system based on speech analysis
CN108664513A (en) * 2017-03-31 2018-10-16 北京京东尚科信息技术有限公司 Method, apparatus and equipment for pushing keyword
CN110880316A (en) * 2019-10-16 2020-03-13 苏宁云计算有限公司 Audio output method and system

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4423327B2 (en) * 2005-02-08 2010-03-03 日本電信電話株式会社 Information communication terminal, information communication system, information communication method, information communication program, and recording medium recording the same
CN101075435B (en) * 2007-04-19 2011-05-18 深圳先进技术研究院 Intelligent chatting system and its realizing method
CN107342075A (en) * 2016-07-22 2017-11-10 江苏泰格软件有限公司 A kind of Voice command performs the System and method for of APS system commands
CN107220228B (en) * 2017-06-13 2019-08-16 深圳市鹰硕技术有限公司 A kind of teaching recorded broadcast data correction device
CN107341251A (en) * 2017-07-10 2017-11-10 江西博瑞彤芸科技有限公司 A kind of extraction and the processing method of medical folk prescription and keyword
CN108053823A (en) * 2017-11-28 2018-05-18 广西职业技术学院 A kind of speech recognition system and method
CN108647346B (en) * 2018-05-15 2021-10-29 苏州东巍网络科技有限公司 Old people voice interaction method and system for wearable electronic equipment
CN109145276A (en) * 2018-08-14 2019-01-04 杭州智语网络科技有限公司 A kind of text correction method after speech-to-text based on phonetic

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070106685A1 (en) * 2005-11-09 2007-05-10 Podzinger Corp. Method and apparatus for updating speech recognition databases and reindexing audio and video content using the same
CN105069034A (en) * 2015-07-22 2015-11-18 无锡天脉聚源传媒科技有限公司 Recommendation information generation method and apparatus
CN107305768A (en) * 2016-04-20 2017-10-31 上海交通大学 Easy wrongly written character calibration method in interactive voice
CN108664513A (en) * 2017-03-31 2018-10-16 北京京东尚科信息技术有限公司 Method, apparatus and equipment for pushing keyword
CN107977443A (en) * 2017-12-10 2018-05-01 吴静 A kind of intelligent tutoring method and its system based on speech analysis
CN110880316A (en) * 2019-10-16 2020-03-13 苏宁云计算有限公司 Audio output method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363531A (en) * 2022-01-14 2022-04-15 中国平安人寿保险股份有限公司 H5-based case comment video generation method, device, equipment and medium
CN114363531B (en) * 2022-01-14 2023-08-01 中国平安人寿保险股份有限公司 H5-based text description video generation method, device, equipment and medium

Also Published As

Publication number Publication date
CN110880316A (en) 2020-03-13
CA3158353A1 (en) 2021-04-22

Similar Documents

Publication Publication Date Title
WO2021073138A1 (en) Audio output method and system
KR102315732B1 (en) Speech recognition method, device, apparatus, and storage medium
EP3835973A1 (en) Keyword extraction method, keyword extraction device and computer-readable storage medium
EP2252995B1 (en) Method and apparatus for voice searching for stored content using uniterm discovery
WO2018157789A1 (en) Speech recognition method, computer, storage medium, and electronic apparatus
EP3179475A1 (en) Voice wakeup method, apparatus and system
CN113327609B (en) Method and apparatus for speech recognition
WO2012027095A1 (en) Speech recognition language model
WO2008124368A1 (en) Method and apparatus for distributed voice searching
KR20090111825A (en) Method and apparatus for language independent voice indexing and searching
CN109033060B (en) Information alignment method, device, equipment and readable storage medium
CN111832308B (en) Speech recognition text consistency processing method and device
TW201606750A (en) Speech recognition using a foreign word grammar
CN106383814B (en) English social media short text word segmentation method
CN106713111B (en) Processing method for adding friends, terminal and server
CN111279333B (en) Language-based search of digital content in a network
CN104199825A (en) Information inquiry method and system
WO2024045475A1 (en) Speech recognition method and apparatus, and device and medium
WO2012004955A1 (en) Text correction method and recognition method
CN111209367A (en) Information searching method, information searching device, electronic equipment and storage medium
CN109933773A (en) A kind of multiple semantic sentence analysis system and method
CN107424612A (en) Processing method, device and machine readable media
CN110413770B (en) Method and device for classifying group messages into group topics
CN111046168A (en) Method, apparatus, electronic device, and medium for generating patent summary information
CN116450799A (en) Intelligent dialogue method and equipment applied to traffic management service

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20877756

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 3158353

Country of ref document: CA

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20877756

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20877756

Country of ref document: EP

Kind code of ref document: A1