WO2022037526A1 - 一种语音识别方法、装置、电子设备及存储介质 - Google Patents

一种语音识别方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2022037526A1
WO2022037526A1 PCT/CN2021/112754 CN2021112754W WO2022037526A1 WO 2022037526 A1 WO2022037526 A1 WO 2022037526A1 CN 2021112754 W CN2021112754 W CN 2021112754W WO 2022037526 A1 WO2022037526 A1 WO 2022037526A1
Authority
WO
WIPO (PCT)
Prior art keywords
industry
vector
speech recognition
text information
word
Prior art date
Application number
PCT/CN2021/112754
Other languages
English (en)
French (fr)
Inventor
徐文铭
郑翔
杨晶生
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2022037526A1 publication Critical patent/WO2022037526A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications

Definitions

  • the embodiments of the present disclosure relate to the field of computer technologies, for example, to a speech recognition method, apparatus, electronic device, and storage medium.
  • the server can use automatic speech recognition (Automatic Speech Recognition, ASR) technology to transcribe the audio data into text, and send the transcribed text to the corresponding client, so that the client can
  • ASR Automatic Speech Recognition
  • the terminal displays the subtitles corresponding to the audio data.
  • the thesaurus used in the ASR model in the related art is usually a general thesaurus. When performing speech recognition based on the general thesaurus in a communication scenario containing specialized terminology in a specific field, the recognition accuracy is low and the user experience is poor.
  • Embodiments of the present disclosure provide a speech recognition method, apparatus, electronic device, and storage medium, which improve the recognition accuracy of audio data and improve user experience.
  • an embodiment of the present disclosure provides a speech recognition method, including:
  • speech recognition is performed on the pulled audio data within the communication range.
  • an embodiment of the present disclosure further provides a speech recognition device, including:
  • a keyword extraction module configured to satisfy the vocabulary selection condition in response to the current condition, obtain the subtitle text information within the communication range, and extract the keywords of the subtitle text information
  • the industry representation vector selection module is configured to determine the representation vector of the subtitle text information according to the word vector of the keyword, and select a target industry representation vector similar to the representation vector of the subtitle text information from the preset industry representation vector ;
  • the speech recognition module is configured to perform speech recognition on the pulled audio data within the communication range based on the industry-specific thesaurus corresponding to the target industry representation vector.
  • an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:
  • storage means arranged to store at least one program
  • the at least one processor When the at least one program is executed by the at least one processor, the at least one processor implements the speech recognition method according to the first aspect of the present disclosure.
  • an embodiment of the present disclosure further provides a storage medium containing computer-executable instructions, the computer-executable instructions, when executed by a computer processor, are used to execute the speech recognition method according to the first aspect of the present disclosure .
  • FIG. 2 is a schematic flowchart of a speech recognition method provided in Embodiment 2 of the present disclosure
  • FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to Embodiment 3 of the present disclosure.
  • FIG. 4 is a schematic structural diagram of an electronic device according to Embodiment 4 of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a schematic flowchart of a speech recognition method according to Embodiment 1 of the present disclosure, and the embodiment of the present disclosure is applicable to a situation of performing speech recognition in a communication scenario that includes technical terms in a specific field.
  • the method may be performed by a speech recognition apparatus, which may be implemented in the form of software and/or hardware, and the apparatus may be configured in an electronic device, for example, in a backend server of a communication application.
  • the speech recognition method provided by this embodiment includes:
  • the communication range may be considered as a range composed of clients participating in the communication.
  • a plurality of participating clients of the multimedia conference may form a communication range.
  • Different communication scopes may have different industry fields involved in the communication content; for the same communication scope, as the communication process continues, the industry fields involved in the communication content may change.
  • the subtitle text information is obtained, which helps to dynamically update the communication range when the target industry-specific vocabulary for each communication range is selected subsequently.
  • the current most matching target industry-specific thesaurus thereby improving the accuracy of speech recognition.
  • the client participating in the communication can upload audio data to the streaming media server of the communication application in real time during the communication process; correspondingly, the backend server of the communication application can pull each communication from the streaming media server
  • the audio data uploaded by the clients within the range is transcribed into subtitle text information based on the ASR technology, so as to obtain the subtitle text information of each communication range.
  • the back-end server of the communication application may also, in response to a request for opening subtitles sent by the client within the communication range, pull from the streaming media server the subtitles within the communication range to which the client belongs. Audio data to obtain subtitle text information within the communication range to which the client belongs.
  • the back-end server when the back-end server does not receive a subtitle-opening request sent by a client within the communication range, it is not necessary to provide speech recognition services for the client within the communication range, thus saving post-processing to a certain extent. The resource consumption of the end server.
  • the back-end server may perform information mining on the subtitle text information within each communication range to extract keywords in the subtitle text information.
  • the keyword can be considered as the professional term of the industry field included in the subtitle text information, and can be used to determine the industry field involved in the communication scope.
  • the back-end server may extract at least one keyword of the subtitle text information based on a natural language processing (Natural Language Processing, NLP) algorithm.
  • NLP Natural Language Processing
  • extracting keywords of subtitle text information may include: extracting word information whose part of speech is a preset part of speech in the subtitle text information, and filtering preset word information from the extracted word information For common words, use the filtered word information as keywords.
  • extracting the word information whose part of speech is a preset part of speech in the subtitle text information may be, based on the part of speech tagging algorithm in the NLP algorithm, tagging the words in the subtitle text information; Extract the same words.
  • the part-of-speech tagging algorithm may be, for example, a Hidden Markov Model (Hidden Markov Model, HMM) or a conditional random field (Conditional random fields, CRFs).
  • the preset part-of-speech can be, for example, nouns, verbs and other parts of speech.
  • the preset common words may be common words under the preset part of speech.
  • the preset common words may be, for example, common words such as "customer" and "colleague”.
  • filtering the preset common words from the extracted word information may be to delete the word information belonging to the preset common words in the extracted word information.
  • the filtered word information can be considered as professional vocabulary in the industry field.
  • the extraction of keywords in the subtitle text information can be realized by extracting and filtering the text information of the preset part of speech in the subtitle text information.
  • S120 Determine the representation vector of the subtitle text information according to the word vector of the keyword, and select a target industry representation vector similar to the representation vector of the subtitle text information from the preset industry representation vector.
  • the word vector of the keyword when there is one keyword, can be used as the representation vector of the subtitle text information; when there are multiple keywords, the word vector of the multiple keywords can be integrated to determine the subtitle text information characterization vector.
  • the manner of determining the representation vector of the subtitle text information by integrating the word vectors of multiple keywords may, for example, configure weights for the multiple keywords, and synthesize the word vectors of the multiple keywords based on the weighting method.
  • the preset industry representation vector can be set in the following ways: screening professional terms in several industry fields; constructing professional terms in the same industry field into an industry-specific thesaurus to obtain several industry-specific thesaurus; For professional terms in several industry-specific thesaurus, word to vector (Word2vec) model training can output word vectors of multiple professional terms; according to the word vectors of professional terms in the industry-specific thesaurus, the corresponding industry is determined
  • the representation vector of the exclusive thesaurus, and the representation vector of the industry-specific thesaurus is used as the industry representation vector.
  • the representation vector of the corresponding industry-specific thesaurus is determined according to the word vectors of the specialized terms in the same industry field. For example, the average value of the word vectors of the specialized terms in the same industry field can be used as the representation vector of the corresponding industry-specific thesaurus.
  • determining the representation vector of the subtitle text information according to the word vector of the keyword includes: loading the word vector for at least one keyword, and using the average vector of the loaded word vectors as the subtitle text A representation vector of information.
  • performing word vector loading on at least one keyword includes: judging whether the preset corpus contains at least one keyword; in response to the preset corpus containing all keywords in the at least one keyword, according to the preconfigured corpus-word The corresponding relationship of the vector library, the word vector of at least one keyword is read from the preset word vector library; in response to the preset corpus not including all keywords in the at least one keyword, using the pre-trained word vector model, the At least one keyword is converted into a word vector; in response to the preset corpus containing some keywords in the at least one keyword, read from the preset word vector library according to the preconfigured corpus-word vector library correspondence The word vectors of the partial keywords, and using a pre-trained word vector model to convert some keywords in at least one keyword that are not included in the preset corpus into word vectors; wherein, the word vector model is based on pre- Suppose the corpus is trained.
  • the preset corpus may be, for example, several industry-specific lexicons constructed during the setting process of the preset industry representation vector.
  • the word vector library can be constructed in the following way, for example: in the process of setting the preset industry representation vector, the word vectors of the professional terms in the exclusive thesaurus of the same industry are constructed into a word vector library, and several preset word vectors are obtained. library.
  • the correspondence between the corpus and the word vector library can be configured, for example, in the following manner: in the process of setting the preset industry representation vector, by configuring the correspondence between each professional term and the word vector of each professional term, the corpus can be configured.
  • the pre-trained word vector model may be, for example, the Word2vec model output after being trained by the Word2vec model in the process of setting the preset industry representation vector.
  • selecting a target industry representation vector similar to the representation vector of the subtitle text information from the preset industry representation vectors including: separately calculating the representation vector of the subtitle text information and the candidate preset industry The similarity of the characterization vectors, and the preset industry characterization vector corresponding to the maximum similarity is used as the target industry characterization vector.
  • the similarity between the representation vector of the subtitle text information and the candidate preset industry representation vector is calculated separately, for example, the cosine similarity, Euclidean distance or Manhattan distance between the representation vector of the subtitle text information and the candidate preset industry representation vector are separately calculated. distance etc.
  • the preset industry representation vector corresponding to the maximum similarity as the target industry representation vector, it can be used to determine the industry field involved in the communication scope and the industry-specific thesaurus that best matches, so as to improve communication The recognition accuracy of the audio data in the range.
  • the back-end server may store several constructed industry-specific thesaurus when the industry representation vector is preset, and may store several industry-specific thesaurus when determining the target industry representation vector for each communication range. Select the target industry representation vector from an industry-specific thesaurus. For each communication range, the back-end server can also load the industry-specific thesaurus corresponding to the selected target industry representation vector into the storage space corresponding to the communication range, so as to realize the industry-specific thesaurus in the storage space corresponding to the communication range. , to perform speech recognition on the audio data within the communication range, avoiding the misuse of the industry-specific thesaurus between the communication ranges during the speech recognition process.
  • speech recognition is performed on the pulled audio data within the communication range. , to be used as a dictionary in the speech recognition process; further, according to the pre-trained acoustic model, dictionary and language model, the audio data is output in text to realize speech recognition.
  • performing speech recognition on the pulled audio data within the communication range includes: identifying the industry-specific lexicon corresponding to the target industry characterization vector.
  • the thesaurus is sent to the speech recognition engine; the pulled audio data within the communication range is sent to the speech recognition engine, so that the speech recognition engine configures the industry-specific thesaurus as a hot word in the communication range. Audio data within communication range for speech recognition.
  • the backend server usually utilizes a speech recognition engine (which may be referred to as an ASR engine) to perform speech recognition in multiple communication ranges.
  • a speech recognition engine which may be referred to as an ASR engine
  • the back-end server can send the industry-specific thesaurus in the storage space corresponding to the communication range to the ASR engine, and can also send the pulled audio data within the communication range to the ASR engine.
  • the ASR engine can configure the professional terms in the industry-specific thesaurus as hot words in the communication range, and the configured hot words can take effect in real time.
  • the audio data within the communication range is subjected to speech recognition, for example, firstly, using a pre-trained acoustic model to output the phoneme information of the audio data within the communication range (such as Chinese pinyin, or English Then, find words matching the phoneme information from the original thesaurus and the configuration hot words; finally, input the found matching words into the pre-trained language model, so that the language model calculates the input
  • the relative probability of the matched words (wherein the probability of being set as a hot word can be correspondingly increased), and the associated word with the highest probability is output as the speech recognition result.
  • the back-end server may be a real-time communication server, such as an instant messaging server, a multimedia conference server, a live video server, or a group chat interaction server. While the real-time communication server provides communication services for clients, it can also provide open captioning services for each communication client. When the client requests to enable the subtitle service within the communication range, and the communication range involves technical terms in some industry fields, the real-time communication server may, based on the speech recognition method provided in this embodiment, perform an analysis on the audio data within the communication range to which the client belongs. Voice recognition is performed, thereby improving the accuracy of voice recognition and improving user experience.
  • a real-time communication server such as an instant messaging server, a multimedia conference server, a live video server, or a group chat interaction server.
  • the real-time communication server provides communication services for clients, it can also provide open captioning services for each communication client.
  • the real-time communication server may, based on the speech recognition method provided in this embodiment, perform an analysis on the audio data within the communication range to which the client belongs. Voice recognition is
  • the subtitle text information within the communication range is obtained, and the keywords of the subtitle text information are extracted; the representation vector of the subtitle text information is determined according to the word vector of the keyword, A target industry representation vector similar to the representation vector of the subtitle text information is selected from the preset industry representation vector; based on the industry-specific thesaurus corresponding to the target industry representation vector, speech recognition is performed on the pulled audio data within the communication range.
  • an industry-specific thesaurus that matches the specialized terminology in the field can be selected for speech recognition, thereby improving the accuracy of speech recognition and improving the user experience.
  • the embodiments of the present disclosure may be combined with various optional solutions in the speech recognition method provided in the above-mentioned embodiments.
  • the speech recognition method provided in this embodiment can perform speech recognition on audio data in the communication range based on similar words of key words, which enriches the recognition scheme; and can also perform speech recognition on industry-specific thesaurus and/or similar words used for speech recognition Updating can realize speech recognition according to the updated industry-specific thesaurus and/or similar words for speech recognition, thereby improving speech recognition accuracy.
  • FIG. 2 is a schematic flowchart of a speech recognition method according to Embodiment 2 of the present disclosure. As shown in Figure 2, the speech recognition method provided by this embodiment includes:
  • the moment when the first client joins the communication range may be used as the communication start moment of the communication range.
  • the back-end server counts the subtitle text information within the communication range from the communication start time. When the number of pieces of subtitle text information accumulated within the communication range reaches the first preset value for the first time, the accumulated first preset value pieces of subtitle text information can be acquired for the first time, and the keywords of the subtitle text information can be extracted.
  • the first preset value may be preset according to an empirical value or an experimental value, for example, it may be 30 or 50 items.
  • the subtitle text information within the communication range of the first preset value bar may be considered as subtitle text information obtained by performing speech recognition on audio data based on the original vocabulary before the ASR engine configures hot words for the communication range.
  • the preset corpora disclosed in this embodiment and the above-mentioned embodiments are the same corpus, for example, several industry-specific lexicons constructed during the setting process of the preset industry representation vector.
  • the back-end server can calculate the similarity between the word vector of the keyword and the word vector of words in multiple industry-specific thesaurus (such as cosine similarity, etc.) , select a preset number of similarities, and use the words corresponding to the selected similarities as similar words of the keyword.
  • the preset threshold can be set according to an empirical value or an experimental value, for example, it can be 0.7, 0.8 or 0.9.
  • the preset number can also be set according to an empirical value or an experimental value, for example, it can be one or two. Wherein, when the number of similarities greater than the preset threshold is less than the preset number, all words corresponding to the similarities greater than the preset threshold may be selected as similar words of the keyword.
  • the backend server may also store similar words in the storage space corresponding to the communication range, send the similar words in the storage space corresponding to the communication range to the ASR engine, and pull the similar words in the storage space corresponding to the communication range.
  • the audio data within the communication range is sent to the ASR engine.
  • the ASR engine can configure similar words as hot words in the communication range, so that the ASR engine can perform speech recognition in the communication range according to the similar words and improve speech recognition accuracy.
  • both steps S231-S241 and steps S232-S242 may be executed, or any branch process may be executed (that is, only steps S231-S241 are executed without executing steps S232-S242, or only steps S232- Step S242 does not execute steps S231-S241).
  • steps S231-S241 and S232-S242 are both executed, the execution order of steps S231-S241 and S232-S242 does not have strict timing requirements, and the ASR engine can convert the words and similar words in the industry-specific thesaurus into Hotwords configured as communication scopes. According to more abundant hot words, voice recognition is performed on the audio data within the communication range, which can improve the recognition accuracy of the speech recognition within the communication range.
  • Every preset time obtain subtitle text information within the preset time within the communication range; and/or, every time a second preset value is accumulated within the communication range of subtitle text information, Obtain the subtitle text information within the communication range of the second preset value bar.
  • the subtitle text information can be cyclically obtained from the first determination of the industry-specific thesaurus until the end of the communication.
  • the subtitle text information may be acquired cyclically from the first determination of the similar words until the end of the communication.
  • the cyclically obtained subtitle text information can be considered as subtitle text information obtained by performing speech recognition on audio data based on the original thesaurus and the hot words after the ASR engine configures the hot words for the communication range.
  • the preset time can be preset according to an empirical value or an experimental value, for example, it can be 3 minutes or 5 minutes; wherein, the second preset value can also be preset according to an empirical value or an experimental value, and can be the same as the first preset value.
  • the set value is the same or may be different.
  • the back-end server cyclically obtains the subtitle text information, which can dynamically update the target industry-specific thesaurus and/or similar words to Recognize audio data to improve speech recognition accuracy.
  • the backend server may obtain the subtitle text information, and may repeat steps S220 and S231 to select the latest target industry representation vector.
  • the manner of updating the corresponding industry-specific thesaurus may be coverage update or incremental update.
  • the audio data can be recognized according to the industry-specific thesaurus that best matches the communication range, thereby improving the accuracy of speech recognition.
  • using the industry-specific thesaurus corresponding to the newly selected target industry representation vector to update the industry-specific thesaurus corresponding to the previously selected target industry representation vector may include: judging the latest selection Whether the target industry representation vector is the same as the previously selected target industry representation vector; in response to the difference between the newly selected target industry representation vector and the previously selected target industry representation vector, use the industry corresponding to the newly selected target industry representation vector Exclusive thesaurus, which replaces the industry-specific thesaurus corresponding to the previously selected target industry representation vector.
  • the storage space overhead corresponding to the communication range can be saved to a certain extent.
  • the backend server may acquire the subtitle text information, and may repeat steps S220 and S232 to select the latest similar words.
  • the manner of updating the corresponding similar words may be coverage update or incremental update.
  • the audio data can be recognized according to the most matching similar words in the communication range, thereby improving the accuracy of speech recognition.
  • steps S231-S241 and S232-S242 are both executed, steps S261-S281 and S262-S282 are also executed, and the execution order of steps S261-S281 and S262-S282 is not strict timing requirements. If only steps S231-S241 are executed, only steps S261-S281 are executed accordingly. If only steps S232-S242 are executed, only steps S262-S282 are executed accordingly.
  • the technical solutions of the embodiments of the present disclosure can perform speech recognition on audio data in the communication range based on similar words of key words, which enriches the recognition solution; and can also update the industry-specific thesaurus and/or similar words used for speech recognition , the speech recognition can be performed according to the updated industry-specific thesaurus of speech recognition and/or the updated similar words, and the speech recognition accuracy is improved.
  • the speech recognition method provided by the embodiment of the present disclosure and the speech recognition method provided by the above-mentioned embodiment belong to the same technical concept, and the technical details not described in detail in this embodiment can be referred to the above-mentioned embodiment.
  • FIG. 3 is a schematic structural diagram of a speech recognition apparatus according to Embodiment 3 of the present disclosure.
  • the speech recognition device provided in this embodiment is suitable for the situation of performing speech recognition in a communication scenario including specialized terms in a specific field.
  • the speech recognition device includes:
  • the industry representation vector selection module 320 is configured to determine the representation vector of the subtitle text information according to the word vector of the keyword, and select a target industry representation vector similar to the representation vector of the subtitle text information from the preset industry representation vector;
  • the speech recognition module 330 is configured to perform speech recognition on the pulled audio data within the communication range based on the industry-specific thesaurus corresponding to the target industry representation vector.
  • the keyword extraction module includes:
  • the keyword extraction sub-module is configured to extract word information whose part of speech is a preset part of speech in the subtitle text information, filter preset common words from the extracted word information, and use the filtered word information as a keyword.
  • the number of the keywords is at least one, and the industry characterization vector selection module includes:
  • the sub-module for determining the subtitle representation vector is configured to load the word vector for at least one keyword, and use the average vector of the loaded word vectors as the representation vector of the subtitle text information.
  • the sub-module for determining the subtitle representation vector includes:
  • the loading unit is configured to judge whether the preset corpus contains at least one keyword; in response to the preset corpus containing all the keywords in the at least one keyword, according to the preconfigured corpus-word vector library correspondence, from Read the word vector of at least one keyword from the preset word vector library; in response to the preset corpus not including all the keywords in the at least one keyword, use the pre-trained word vector model to convert the at least one keyword is a word vector; in response to the preset corpus containing some keywords in the at least one keyword, read the part from the preset word vector library according to the preconfigured corpus-word vector library correspondence word vectors of keywords, and use a pre-trained word vector model to convert some keywords in the at least one keyword that are not included in the preset corpus into word vectors; wherein, the word vector model is based on a preset corpus training.
  • the industry representation vector selection module includes:
  • the industry representation vector selection sub-module is set to calculate the similarity between the representation vector of the subtitle text information and the candidate preset industry representation vector respectively, and use the preset industry representation vector corresponding to the maximum similarity as the target industry representation vector.
  • the speech recognition module includes:
  • Thesaurus sending sub-module is set to send the industry-specific thesaurus corresponding to the target industry representation vector to the speech recognition engine;
  • the audio data sending sub-module is set to send the pulled audio data within the communication range to the speech recognition engine, so that the speech recognition engine configures the industry-specific thesaurus as a hot word in the communication range. Audio data within communication range for speech recognition.
  • the speech recognition apparatus further includes:
  • the similar word selection module is set to select a word vector similar to the word vector of the keyword from the preset corpus, and use the word corresponding to the similar word vector as the similar word of the keyword;
  • the speech recognition module is further configured to perform speech recognition on the pulled audio data within the communication range based on similar words.
  • the industry representation vector selection module is further configured to select the latest target industry representation vector every time the current condition satisfies the thesaurus selection condition;
  • Thesaurus updating module is set to use the industry-specific thesaurus corresponding to the newly selected target industry representation vector to update the industry-specific thesaurus corresponding to the previously selected target industry representation vector;
  • the speech recognition module is set to perform speech recognition on the pulled audio data within the communication range based on the updated industry-specific thesaurus.
  • the thesaurus updating module is configured to: determine whether the newly selected target industry representation vector is the same as the previously selected target industry representation vector; If the target industry representation vectors selected once are not the same, the industry-specific thesaurus corresponding to the newly selected target industry representation vector is used to replace the industry-specific thesaurus corresponding to the previously selected target industry representation vector.
  • the similar word selection module is configured to select the latest similar words every time the current condition satisfies the thesaurus selection condition
  • Similar words update module set to use the newly selected similar words to update the previously determined similar words
  • the speech recognition module is configured to perform speech recognition on the pulled audio data within the communication range based on the updated similar words.
  • the similar word update module is configured to de-duplicate the newly selected similar words and the same words in the previously determined similar words; and the previously determined similar words as the updated similar words.
  • the speech recognition apparatus is applied to a real-time communication server
  • the real-time communication server includes at least one of an instant messaging server, a multimedia conference server, a live video server, and a group chat interaction server.
  • the speech recognition apparatus provided by the embodiment of the present disclosure can execute the speech recognition method provided by any embodiment of the present disclosure, and has functional modules corresponding to the execution method.
  • FIG. 4 it shows a schematic structural diagram of an electronic device (eg, a terminal device or a server in FIG. 4 ) 400 suitable for implementing an embodiment of the present disclosure.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, personal digital assistants (Personal Digital Assistant, PDA), PAD (tablet computers), portable multimedia players (Portable Media Players) , PMP), mobile terminals such as in-vehicle terminals (eg, in-vehicle navigation terminals), etc., as well as fixed terminals such as digital televisions (Television, TV), desktop computers, and the like.
  • PDA Personal Digital Assistant
  • PAD tablet computers
  • PMP portable multimedia players
  • PMP portable multimedia players
  • the electronic device 400 may include a processing device (such as a central processing unit, a graphics processor, etc.) 401, which may be stored in a read-only memory (Read-Only Memory, ROM) 402 according to a program or from a storage device 408 is a program loaded into a random access memory (RAM) 403 to perform various appropriate actions and processes.
  • ROM Read-Only Memory
  • RAM random access memory
  • various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 401, the ROM 402, and the RAM 403 are connected to each other through a bus 404.
  • An input/output (Input/Ou ut, I/O) interface 405 is also connected to the bus 404 .
  • the following devices can be connected to the I/O interface 405: input devices 406 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a Liquid Crystal Display (LCD) output device 407 , a speaker, a vibrator, etc.; a storage device 408 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 409 .
  • Communication means 409 may allow electronic device 400 to communicate wirelessly or by wire with other devices to exchange data.
  • FIG. 4 shows electronic device 400 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 409 , or from the storage device 408 , or from the ROM 402 .
  • the processing device 401 When the computer program is executed by the processing device 401, the above-mentioned functions defined in the speech recognition method of the embodiment of the present disclosure are executed.
  • the electronic device provided by the embodiments of the present disclosure and the speech recognition method provided by the above embodiments belong to the same disclosed concept, and the technical details not described in detail in the embodiments of the present disclosure may refer to the above embodiments.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • the client and server can use any currently known or future developed network protocol such as HTTP (Hyper Text Transfer Protocol) to communicate, and can communicate with digital data in any form or medium.
  • Data communications eg, communication networks
  • Examples of communication networks include Local Area Networks (LANs), Wide Area Networks (WANs), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently Known or future developed networks.
  • LANs Local Area Networks
  • WANs Wide Area Networks
  • the Internet eg, the Internet
  • peer-to-peer networks eg, ad hoc peer-to-peer networks
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries at least one program, and when the above-mentioned at least one program is executed by the electronic device, causes the electronic device to:
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages - such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains at least one configurable function for implementing the specified logical function. Execute the instruction.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • the units involved in the embodiments of the present disclosure may be implemented in a software manner, and may also be implemented in a hardware manner. Wherein, the names of units and modules do not constitute limitations on the units and modules themselves in certain circumstances.
  • the data generation module may also be described as a "video data generation module”.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include at least one wire-based electrical connection, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • flash memory flash memory
  • fiber optics compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • CD-ROM compact disk read only memory
  • magnetic storage devices or any suitable combination of the foregoing.
  • Example 1 provides a speech recognition method, the method includes:
  • speech recognition is performed on the pulled audio data within the communication range.
  • Example 2 provides a speech recognition method, further comprising:
  • acquiring the subtitle text information within the communication range includes:
  • every preset time obtain subtitle text information within a preset time within the communication range; and/or, every second preset value accumulated within the communication range of subtitle text information At the time of acquisition, the subtitle text information within the communication range of the second preset value bar is obtained.
  • Example 3 provides a speech recognition method, further comprising:
  • the extracting the keywords of the subtitle text information includes:
  • Example 4 provides a speech recognition method, further comprising:
  • the quantity of the keyword is at least one
  • the determining the representation vector of the subtitle text information according to the word vector of the keyword includes:
  • a word vector is loaded for at least one keyword, and an average vector of the loaded word vectors is used as a representation vector of the caption text information.
  • the word vector model is based on the preset corpus training.
  • the selecting a target industry representation vector similar to the representation vector of the subtitle text information from the preset industry representation vector includes:
  • Example 7 provides a speech recognition method, further comprising:
  • performing speech recognition on the pulled audio data within the communication range based on the industry-specific thesaurus corresponding to the target industry representation vector includes:
  • speech recognition is performed on the pulled audio data within the communication range.
  • performing speech recognition on the pulled audio data within the communication range based on the industry-specific thesaurus corresponding to the target industry representation vector includes:
  • the industry-specific lexicon corresponding to the newly selected target industry characterization vector is used to update the industry-specific lexicon corresponding to the previously selected target industry characterization vector, including:
  • the industry-specific word library corresponding to the newly selected target industry characterization vector is used to replace the industry-specific word corresponding to the previously selected target industry characterization vector library.
  • Example 11 provides a speech recognition method, further comprising:
  • performing speech recognition on the pulled audio data within the communication range based on the similar words includes:
  • speech recognition is performed on the pulled audio data within the communication range.
  • the newly selected similar words after deduplication and the previously determined similar words are used as the updated similar words.
  • Example thirteen provides a speech recognition method, further comprising:
  • the real-time communication server includes at least one of an instant messaging server, a multimedia conference server, a live video server, and a group chat interaction server.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种语音识别方法,包括:当满足词库选取条件时,获取通信范围内的字幕文本信息,并提取字幕文本信息的关键词;根据关键词的词向量确定字幕文本信息的表征向量,并从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量;基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。还提供了一种语音识别装置、语音识别电子设备及语音识别存储介质。

Description

一种语音识别方法、装置、电子设备及存储介质
本申请要求在2020年8月20日提交中国专利局、申请号为202010842909.7的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本公开实施例涉及计算机技术领域,例如涉及一种语音识别方法、装置、电子设备及存储介质。
背景技术
随着互联网以及通信技术的不断发展,通过通信类应用进行信息沟通已成为用户进行信息交流的重要方式之一。当客户端间进行包含音频数据的通信时,服务器可通过自动语音识别(Automatic Speech Recognition,ASR)技术将音频数据转写为文字,并将转写的文字下发至对应客户端,以使客户端显示音频数据对应的字幕。相关技术中的ASR模型选用词库通常为通用词库,基于通用词库在包含特定领域专业术语的通信场景下进行语音识别进行时,识别精度较低,用户体验较差。
发明内容
本公开实施例提供了一种语音识别方法、装置、电子设备及存储介质,提高了音频数据的识别精度,提升了用户体验。
第一方面,本公开实施例提供了一种语音识别方法,包括:
响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,并提取所述字幕文本信息的关键词;
根据所述关键词的词向量确定所述字幕文本信息的表征向量,并从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量;
基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
第二方面,本公开实施例还提供了一种语音识别装置,包括:
关键词提取模块,设置为响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,并提取所述字幕文本信息的关键词;
行业表征向量选取模块,设置为根据所述关键词的词向量确定所述字幕文本信息的表征向量,并从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量;
语音识别模块,设置为基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
第三方面,本公开实施例还提供了一种电子设备,所述电子设备包括:
至少一个处理器;
存储装置,设置为存储至少一个程序,
当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如本公开第一方面所述的语音识别方法。
第四方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开第一方面所述的语音识别方法。
附图说明
图1为本公开实施例一所提供的一种语音识别方法的流程示意图;
图2为本公开实施例二所提供的一种语音识别方法的流程示意图;
图3为本公开实施例三提供的一种语音识别装置结构示意图;
图4为本公开实施例四所提供的一种电子设备结构示意图。
具体实施方式
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“至少一个”。
实施例一
图1为本公开实施例一所提供的一种语音识别方法流程示意图,本公开实施例适用于在包含特定领域专业术语的通信场景下,进行语音识别的情形。该方法可以由语音识别装置来执行,该装置可以通过软件和/或硬件的形式实现,该装置可配置于电子设备中,例如配置于通信类应用的后端服务器中。
如图1所示,本实施例提供的语音识别方法,包括:
S110、于当前条件满足词库选取条件时,获取通信范围内的字幕文本信息,并提取字幕文本信息的关键词。
本公开实施例中,通信范围可以认为是参加通信的客户端所组成的范围,例如,在多媒体会议场景下,多媒体会议的多个参会客户端可组成一个通信范围。不同通信范围,其通信内容所涉及的行业领域可能不同;同一通信范围,随着通信过程的持续,其通信内容所涉及的行业领域可能发生变化。通过在每个通信范围的通信过程中,于当前条件满足词库选取条件时,就获取字幕文本 信息,有助于在后续选取每个通信范围的目标行业专属词库时,动态更新与通信范围当前最匹配的目标行业专属词库,从而提高语音识别精度。
本实施例中,参加通信的客户端在通信过程中,可以实时上传音频数据至通信类应用的流媒体服务器;相应的,通信类应用的后端服务器,可从流媒体服务器拉取每个通信范围内的客户端上传的音频数据,并基于ASR技术将音频数据转写为字幕文本信息,以获取到每个通信范围的字幕文本信息。
在本公开实施例一些可选的实现方式中,通信类应用的后端服务器,还可响应于通信范围内客户端发送的开启字幕请求,从流媒体服务器拉取该客户端所属通信范围内的音频数据,以获取该客户端所属通信范围内的字幕文本信息。在这些可选的实现方式中,当后端服务器未接收到通信范围内客户端发送的开启字幕请求时,可无需为该通信范围的客户端提供语音识别服务,从而可在一定程度上节省后端服务器的资源消耗。
本实施例中,在后端服务器获取到每个通信范围内的字幕文本信息之后,可以对每个通信范围内的字幕文本信息进行信息挖掘,提取字幕文本信息中的关键词。其中,关键词可以认为是,字幕文本信息中所包含的行业领域的专业术语,可以用于确定通信范围涉及的行业领域。其中,后端服务器可以基于自然语言处理(Natural Language Processing,NLP)算法,提取字幕文本信息的至少一个关键词。
在本公开实施例一些可选的实现方式中,提取字幕文本信息的关键词,可以包括:将字幕文本信息中词性为预设词性的词语信息进行抽取,并从抽取的词语信息中过滤预设常用词,将过滤后的词语信息作为关键词。
其中,将字幕文本信息中词性为预设词性的词语信息进行抽取,可以是,基于NLP算法中词性标注算法,对字幕文本信息中的词语进行词性标注;将词性标注结果中,与预设词性相同的词语进行抽取。其中,词性标注算法例如可以是隐马尔可夫模型(Hidden Markov Model,HMM)或条件随机场(Conditional random fields,CRFs)等。其中,预设词性例如可以是名词、动 词等词性。
其中,预设常用词可以是预设词性下的通用词汇。示例性的,当预设词性为名词词性时,预设常用词例如可以是“客户”、“同事”等通用词汇。其中,从抽取的词语信息中过滤预设常用词,可以是,将抽取的词语信息中,属于预设常用词的词语信息进行删除。过滤后的词语信息,可以认为是行业领域的专业词汇。
在这些可选的实现方式中,通过对字幕文本信息中预设词性的文本信息进行抽取并过滤,能够实现字幕文本信息中关键词的提取。
通过获取通信范围内的字幕文本信息,提取字幕文本信息中表征涉及领域的关键词,有助于后续根据关键词的词向量确定字幕文本信息的表征向量,为确定与通信范围涉及的行业领域相关的行业专属词库奠定基础。
S120、根据关键词的词向量确定字幕文本信息的表征向量,并从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量。
本公开实施例中,当关键词为一个时,可以将该关键词的词向量作为字幕文本信息的表征向量;当关键词为多个时,可以综合多个关键词的词向量确定字幕文本信息的表征向量。其中,综合多个关键词的词向量确定字幕文本信息的表征向量的方式,例如可以为多个关键词配置权重,基于加权法综合多个关键词的词向量。
本实施例中,预设行业表征向量可以通过下述方式来设置:筛选若干行业领域的专业术语;将相同行业领域的专业术语构建为一个行业专属词库,得到若干个行业专属词库;基于若干行业专属词库中的专业术语,进行词转向量(word to vector,Word2vec)模型训练,可输出多个专业术语的词向量;根据行业专属词库中的专业术语的词向量,确定相应行业专属词库的表征向量,并将行业专属词库的表征向量作为行业表征向量。其中,根据相同行业领域的专业术语的词向量,确定相应行业专属词库的表征向量,例如可以是将相同行业领域的专业术语的词向量的平均值,作为相应行业专属词库的表征向量。
在本公开实施例一些可选的实现方式中,根据关键词的词向量确定字幕文本信息的表征向量,包括:对至少一个关键词进行词向量加载,将加载的词向量的平均向量作为字幕文本信息的表征向量。
其中,对至少一个关键词进行词向量加载,包括:判断预设语料库中是否包含至少一个关键词;响应于预设语料库中包含至少一个关键词中的全部关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取至少一个关键词的词向量;响应于预设语料库中不包含至少一个关键词中的全部关键词,利用预先训练的词向量模型,将至少一个关键词转化为词向量;响应于所述预设语料库中包含至少一个关键词中的部分关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取所述部分关键词的词向量,并利用预先训练的词向量模型,将至少一个关键词中的没有包含在所述预设语料库中的部分关键词转化为词向量;其中,词向量模型基于预设语料库训练得到。
其中,预设语料库例如可以是,在预设行业表征向量的设置过程中构建的若干个行业专属词库。其中,词向量库例如可以通过如下方式构建:在预设行业表征向量的设置过程中,将相同行业专属词库中专业术语的词向量,构建为一个词向量库,得到若干个预设词向量库。其中,语料库-词向量库的对应关系例如可以通过如下方式进行配置:在预设行业表征向量的设置过程中,通过配置每个专业术语与每个专业术语的词向量的对应关系,实现配置语料库-词向量库的对应关系。其中,预先训练的词向量模型例如可以是,在预设行业表征向量的设置过程中,经过Word2vec模型训练后,所输出的Word2vec模型。
在这些可选的实现方式中,通过设置预设语料库、预先配置的语料库-词向量库的对应关系,以及预先训练的词向量模型,能够实现当关键词属于预设语料库时,直接根据语料库-词向量库对应关系,从词向量库中读取关键词的词向量,从而提升获取关键词的词向量的效率;当关键词语不属于预设语料库时,根据词向量模型将关键词转化为词向量,从而保证词向量的顺利生成。
在本公开实施例一些可选的实现方式中,从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量,包括:分别计算字幕文本信息的表征向量与候选预设行业表征向量的相似度,并将最大相似度对应的预设行业表征向量作为目标行业表征向量。
其中,分别计算字幕文本信息的表征向量与候选预设行业表征向量的相似度,例如可以是,分别计算字幕文本信息的表征向量与候选预设行业表征向量的余弦相似度、欧氏距离或曼哈顿距离等。在这些可选的实现方式中,通过将最大相似度对应的预设行业表征向量作为目标行业表征向量,能够用于确定与通信范围涉及的行业领域,最为匹配的行业专属词库,以提高通信范围内音频数据的识别精度。
S130、基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。
本公开实施例中,后端服务器可在预设行业表征向量时,将构建的若干个行业专属词库进行存储,并可在确定每个通信范围的目标行业表征向量时,从已存储的若干个行业专属词库中选取目标行业表征向量。针对每一个通信范围,后端服务器还可将选取的目标行业表征向量对应的行业专属词库,加载至通信范围对应的存储空间中,以实现根据通信范围对应的存储空间中的行业专属词库,对该通信范围内的音频数据进行语音识别,避免了语音识别过程中,通信范围间的行业专属词库出现误用。
其中,基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别,例如可以是,首先,根据行业专属词库与用于语音识别的原始词库相结合,以用作语音识别过程中的字典;进而,根据预先训练的声学模型、字典和语言模型对音频数据进行文字输出,实现语音识别。
通过基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别,能够实现在包含特定领域专业术语的通信场景下,选取与该领域专业术语相匹配的行业专属词库,以进行语音识别,从而提高了语 音识别精度,提升了用户体验。
在本公开实施例一些可选的实现方式中,基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别,包括:将目标行业表征向量对应的行业专属词库发送至语音识别引擎;将拉取的通信范围内的音频数据发送至语音识别引擎,以使语音识别引擎将行业专属词库配置为通信范围的热词,根据通信范围的热词,对通信范围内的音频数据进行语音识别。
后端服务器通常利用语音识别引擎(可称为ASR引擎),进行多个通信范围的语音识别。针对每个通信范围,后端服务器可以将通信范围对应的存储空间中的行业专属词库发送至ASR引擎,也可将拉取的通信范围内的音频数据发送至ASR引擎。ASR引擎可以将行业专属词库中的专业术语,配置为通信范围的热词,配置的热词内容可实时生效。
其中,根据通信范围的热词,对通信范围内的音频数据进行语音识别,例如可以是,首先,利用预先训练的声学模型输出通信范围内的音频数据的音素信息(例如汉语的拼音,或英语的音标等信息);然后,从原始词库以及配置热词中查找与音素信息相匹配的词语;最后,将查找到的相匹配的词语输入预先训练的语言模型,以使语言模型计算输入的相匹配的词语的相互关联的概率(其中设置为热词的概率可相应提高),并将概率最高的相关联词语输出为语音识别结果。
在这些可选的实现方式中,通过将通信范围涉及行业领域的行业专属词库内的词语配置为热词,有利于将音频数据中作为行业领域专业术语的热词识别出来,能有效提升通信范围语音识别的准确率。
在本公开实施例一些可选的实现方式中,语音识别方法应用于实时通信服务器,且实时通信服务器包括即时通讯服务器、多媒体会议服务器、视频直播服务器和群聊互动服务器中的至少一种。
在这些可选的实现方式中,后端服务器可为实时通信服务器,例如为即时通讯服务器、多媒体会议服务器、视频直播服务器或群聊互动服务器。实时通 信服务器在为客户端提供通信服务的同时,还可以为每个通信客户端提供开启字幕服务。当通信范围内存在客户端请求开启字幕服务,且通信范围涉及某些行业领域的专业术语时,实时通信服务器可基于本实施例提供的语音识别方法,对客户端所属的通信范围内的音频数据进行语音识别,从而可提高语音识别精度,提升了用户体验。
本公开实施例的技术方案,于当前条件满足词库选取条件时,获取通信范围内的字幕文本信息,并提取字幕文本信息的关键词;根据关键词的词向量确定字幕文本信息的表征向量,并从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量;基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。能够实现在包含特定领域专业术语的通信场景下,选取与该领域专业术语相匹配的行业专属词库,以进行语音识别,从而提高了语音识别精度,提升了用户体验。
实施例二
本公开实施例与上述实施例中所提供的语音识别方法中各个可选方案可以结合。本实施例所提供的语音识别方法,能够基于关键词语的相似词语,对通信范围的音频数据进行语音识别,丰富了识别方案;还能够对用于语音识别的行业专属词库和/或相似词语进行更新,能够实现根据更新后的语音识别的行业专属词库和/或相似词语进行语音识别,提高了语音识别精度。
图2为本公开实施例二所提供的一种语音识别方法的流程示意图。如图2所示,本实施例提供的语音识别方法,包括:
S210、于通信范围内积累的字幕文本信息的条数首次达到第一预设数值时,获取第一预设数值条通信范围内的字幕文本信息,并提取字幕文本信息的关键词。
本公开实施例中,首个客户端加入通信范围的时刻,可以作为该通信范围的通信起始时刻。针对每个通信范围,后端服务器从通信起始时刻起,对通信范围内的字幕文本信息进行计数。当通信范围内积累的字幕文本信息的条数首 次达到第一预设数值时,可首次获取积累的第一预设数值条字幕文本信息,并提取字幕文本信息的关键词。
其中,第一预设数值可根据经验值或实验值进行预先设置,例如可以是30条或50条。其中,第一预设数值条通信范围内的字幕文本信息,可认为是,在ASR引擎为通信范围配置热词之前,基于原始词库对音频数据进行语音识别得到的字幕文本信息。
S220、根据关键词的词向量确定字幕文本信息的表征向量。
S231、从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量。
S241、基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。
S232、从预设语料库中选取与关键词的词向量相似的词向量,并将相似的词向量对应的词语作为关键词的相似词语。
本实施例与上述实施例公开的预设语料库为相同语料库,例如皆可以是,在预设行业表征向量的设置过程中构建的若干个行业专属词库。针对每个关键词,后端服务器可计算关键词的词向量与多个行业专属词库中词语的词向量的相似度(例如余弦相似度等),并可从大于预设阈值的相似度中,选取预设数量个相似度,并将选取的相似度对应的词语,作为关键词的相似词语。
其中,预设阈值越高,选取的相似词语与关键词越相关,且预设阈值可根据经验值或实验值进行设置,例如可以为0.7、0.8或0.9。其中,预设个数也可根据经验值或实验值进行设置,例如可以为1个或2个。其中,当大于预设阈值的相似度的数量小于预设个数,可选取全部大于预设阈值的相似度所对应的词语,作为关键词的相似词语。
S242、基于相似词语,对拉取的通信范围内的音频数据进行语音识别。
本实施例中,针对每个通信范围,后端服务器也可以将相似词语存储至通信范围对应的存储空间中,并将通信范围对应的存储空间中的相似词语发送至 ASR引擎,并将拉取的通信范围内的音频数据发送至ASR引擎。ASR引擎可以将相似词语配置为通信范围的热词,以使ASR引擎可以根据相似词语进行通信范围的语音识别,提高语音识别精度。
本公开实施例中,S231-S241步骤和S232-S242步骤可都执行,也可执行任意一条分支流程(也即,仅执行S231-S241步骤而不执行S232-S242步骤,或者,仅执行S232-S242步骤而不执行S231-S241步骤)。当S231-S241步骤和S232-S242步骤都执行时,S231-S241步骤和S232-S242步骤的执行顺序并无严格的时序要求,且ASR引擎可以将行业专属词库中的词语和相似词语,皆配置为通信范围的热词。根据更为丰富的热词,对通信范围内的音频数据进行语音识别,可提升通信范围语音识别的识别精度。
S250、从首次确定行业专属词库开始,每隔预设时间,获取通信范围内在预设时间内的字幕文本信息;和/或,通信范围内每积累第二预设数值条字幕文本信息时,获取第二预设数值条通信范围内的字幕文本信息。
本实施例中,若基于行业专属词库,或者基于行业专属词库和相似词语,进行语音识别时,则可从首次确定行业专属词库开始,循环获取字幕文本信息,直至通信结束。此外,若只基于相似词语进行语音识别时,则可以从首次确定相似词语开始,循环获取字幕文本信息,直至通信结束。
其中,循环获取的字幕文本信息,可以认为是,在ASR引擎为通信范围配置热词之后,基于原始词库和热词,对音频数据进行语音识别得到的字幕文本信息。其中,预设时间可根据经验值或实验值进行预先设置,例如可以是3分钟或5分钟;其中,第二预设数值也可根据经验值或实验值进行预先设置,且可以和第一预设数值相同,也可以不同。
通过后端服务器在每个通信范围的通信过程中,循环获取字幕文本信息,能够实现动态更新目标行业专属词库和/或相似词语,以根据通信范围当前所涉及的行业领域中的专业术语,对音频数据进行识别,从而提高语音识别精度。
S261、于当前条件满足词库选取条件时,选取最新的目标行业表征向量。
本公开实施例中,“通信范围内积累的字幕文本信息的条数首次达到第一预设数值”,以及“从首次确定行业专属词库开始(或从首次确定相似词语开始),每隔预设时间,和/或,通信范围内每积累第二预设数值条字幕文本信息”,皆可以认为是,当前条件满足词库选取条件。每于当前条件满足词库选取条件时,后台服务器可获取字幕文本信息,并可重复S220和S231步骤,选取最新的目标行业表征向量。
S271、利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库。
本公开实施例中,对应的行业专属词库进行更新的方式,可以是覆盖更新或增量更新。通过更新行业专属词库,能够根据通信范围当前最匹配的行业专属词库,对音频数据进行识别,从而提高语音识别精度。
在本公开实施例一些可选的实现方式中,利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库,可以包括:判断最新选取的目标行业表征向量,与前一次选取的目标行业表征向量是否相同;响应于最新选取的目标行业表征向量与前一次选取的目标行业表征向量不相同,利用最新选取的目标行业表征向量对应的行业专属词库,替换前一次选取的目标行业表征向量对应的行业专属词库。在这些可选的实现方式中,通过覆盖更新行业专属词库,可以在一定程度上节省通信范围对应的存储空间开销。
S281、基于更新后的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。
S262、于当前条件满足词库选取条件时,选取到最新的相似词语。
本实施例中,“通信范围内积累的字幕文本信息的条数首次达到第一预设数值”,以及“从首次确定行业专属词库开始(或从首次确定相似词语开始),每隔预设时间,和/或,通信范围内每积累第二预设数值条字幕文本信息”,皆可以认为是,当前条件满足词库选取条件。每于当前条件满足词库选取条件时, 后台服务器可获取字幕文本信息,并可重复S220和S232步骤,选取到最新的相似词语。
S272、利用最新选取的相似词语,更新前一次确定的相似词语。
本公开实施例中,对应的相似词语进行更新的方式,可以是覆盖更新或增量更新。通过更新相似词语,能够根据通信范围当前最匹配的相似词语,对音频数据进行识别,从而提高语音识别精度。
在本公开实施例一些实现方式中,利用最新选取的相似词语,更新前一次确定的相似词语,包括:将最新选取的相似词语,与前一次确定的相似词语中相同的词语进行去重;将去重后最新选取的相似词语,以及前一次确定的相似词语,作为更新后的相似词语。在这些可选的实现方式中,由于相似词语的数量较少,占用空间较少,通过增量更新相似词语,可以提高热词丰富程度,在一定程度上提高语音识别精度。
S282、基于更新后的相似词语,对拉取的通信范围内的音频数据进行语音识别。
本公开实施例中,若S231-S241步骤和S232-S242步骤都执行,则S261-S281步骤和S262-S282步骤也都执行,且S261-S281步骤和S262-S282步骤的执行顺序并无严格的时序要求。若只执行S231-S241步骤,则相应只执行S261-S281步骤。若只执行S232-S242步骤,则相应只执行S262-S282步骤。
本公开实施例的技术方案,能够基于关键词语的相似词语,对通信范围的音频数据进行语音识别,丰富了识别方案;还能够对用于语音识别的行业专属词库和/或相似词语进行更新,能够实现根据更新后的语音识别的行业专属词库和/或更新后的相似词语进行语音识别,提高了语音识别精度。此外,本公开实施例提供的语音识别方法与上述实施例提供的语音识别方法属于同一技术构思,未在本实施例中详尽描述的技术细节可参见上述实施例。
实施例三
图3为本公开实施例三所提供的一种语音识别装置结构示意图。本实施例 提供的语音识别装置适用于在包含特定领域专业术语的通信场景下,进行语音识别的情形。
如图3所示,语音识别装置包括:
关键词提取模块310,设置为响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,并提取字幕文本信息的关键词;
行业表征向量选取模块320,设置为根据关键词的词向量确定字幕文本信息的表征向量,并从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量;
语音识别模块330,设置为基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。
在本公开实施例一些可选的实现方式中,关键词提取模块,包括:
字幕获取子模块,设置为于通信范围内积累的字幕文本信息的条数首次达到第一预设数值时,获取第一预设数值条通信范围内的字幕文本信息;或者,字幕获取子模块设置为以下至少之一:从首次确定行业专属词库开始,每隔预设时间,获取通信范围内在预设时间内的字幕文本信息;通信范围内每积累第二预设数值条字幕文本信息时,获取第二预设数值条通信范围内的字幕文本信息。
在本公开实施例一些可选的实现方式中,关键词提取模块,包括:
关键词提取子模块,设置为将字幕文本信息中词性为预设词性的词语信息进行抽取,并从抽取的词语信息中过滤预设常用词,将过滤后的词语信息作为关键词。
在本公开实施例一些可选的实现方式中,所述关键词的数量为至少一个,行业表征向量选取模块,包括:
字幕表征向量确定子模块,设置为对至少一个关键词进行词向量加载,将加载的词向量的平均向量作为字幕文本信息的表征向量。
在本公开实施例一些可选的实现方式中,字幕表征向量确定子模块,包括:
加载单元,设置为判断预设语料库中是否包含至少一个关键词;响应于预设语料库中包含所述至少一个关键词中的全部关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取至少一个关键词的词向量;响应于预设语料库中不包含所述至少一个关键词中的全部关键词,利用预先训练的词向量模型,将至少一个关键词转化为词向量;响应于所述预设语料库中包含所述至少一个关键词中的部分关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取所述部分关键词的词向量,并利用预先训练的词向量模型,将所述至少一个关键词中的没有包含在所述预设语料库中的部分关键词转化为词向量;其中,词向量模型基于预设语料库训练得到。
在本公开实施例一些可选的实现方式中,行业表征向量选取模块,包括:
行业表征向量选取子模块,设置为分别计算字幕文本信息的表征向量与候选预设行业表征向量的相似度,并将最大相似度对应的预设行业表征向量作为目标行业表征向量。
在本公开实施例一些可选的实现方式中,语音识别模块,包括:
词库发送子模块,设置为将目标行业表征向量对应的行业专属词库发送至语音识别引擎;
音频数据发送子模块,设置为将拉取的通信范围内的音频数据发送至语音识别引擎,以使语音识别引擎将行业专属词库配置为通信范围的热词,根据通信范围的热词,对通信范围内的音频数据进行语音识别。
在本公开实施例一些可选的实现方式中,语音识别装置,还包括:
相似词语选取模块,设置为从预设语料库中选取与关键词的词向量相似的词向量,并将相似的词向量对应的词语作为关键词的相似词语;
相应的,语音识别模块,还设置为基于相似词语,对拉取的通信范围内的音频数据进行语音识别。
在本公开实施例一些可选的实现方式中,行业表征向量选取模块,还设置为每于当前条件满足词库选取条件时,选取最新的目标行业表征向量;
词库更新模块,设置为利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库;
相应的,语音识别模块,设置为基于更新后的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。
在本公开实施例一些实现方式中,词库更新模块,设置为:判断最新选取的目标行业表征向量,与前一次选取的目标行业表征向量是否相同;响应于最新选取的目标行业表征向量与前一次选取的目标行业表征向量不相同,利用最新选取的目标行业表征向量对应的行业专属词库,替换前一次选取的目标行业表征向量对应的行业专属词库。
在本公开实施例一些可选的实现方式中,相似词语选取模块,设置为每于当前条件满足词库选取条件时,选取到最新的相似词语;
相似词语更新模块,设置为利用最新选取的相似词语,更新前一次确定的相似词语;
相应的,语音识别模块,设置为基于更新后的相似词语,对拉取的通信范围内的音频数据进行语音识别。
在本公开实施例一些实现方式中,相似词语更新模块,设置为将最新选取的相似词语,与前一次确定的相似词语中相同的词语进行去重;将去重后的最新选取的相似词语,以及前一次确定的相似词语,作为更新后的相似词语。
在本公开实施例一些可选的实现方式中,语音识别装置,应用于实时通信服务器,且实时通信服务器包括即时通讯服务器、多媒体会议服务器、视频直播服务器和群聊互动服务器中的至少一种。
本公开实施例所提供的语音识别装置,可执行本公开任意实施例所提供的语音识别方法,具备执行方法相应的功能模块。
值得注意的是,上述装置所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的 保护范围。
实施例四
下面参考图4,其示出了适于用来实现本公开实施例的电子设备(例如图4中的终端设备或服务器)400的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、个人数字助理(Personal Digital Assistant,PDA)、PAD(平板电脑)、便携式多媒体播放器(Portable Media Player,PMP)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字电视(Television,TV)、台式计算机等等的固定终端。图4示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图4所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)401,其可以根据存储在只读存储器(Read-Only Memory,ROM)402中的程序或者从存储装置408加载到随机访问存储器(Random Access Memory,RAM)403中的程序而执行各种适当的动作和处理。在RAM 403中,还存储有电子设备400操作所需的各种程序和数据。处理装置401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(Input/Ou中ut,I/O)接口405也连接至总线404。
通常,以下装置可以连接至I/O接口405:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置406;包括例如液晶显示器(Liquid Crystal Display,LCD)、扬声器、振动器等的输出装置407;包括例如磁带、硬盘等的存储装置408;以及通信装置409。通信装置409可以允许电子设备400与其他设备进行无线或有线通信以交换数据。虽然图4示出了具有各种装置的电子设备400,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承 载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置409从网络上被下载和安装,或者从存储装置408被安装,或者从ROM402被安装。在该计算机程序被处理装置401执行时,执行本公开实施例的语音识别方法中限定的上述功能。
本公开实施例提供的电子设备与上述实施例提供的语音识别方法属于同一公开构思,未在本公开实施例中详尽描述的技术细节可参见上述实施例。
实施例五
本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的语音识别方法。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有至少一个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)或闪存(FLASH)、光纤、便携式紧凑磁盘只读存储器(Compact Disc-Read Only Memory,CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计 算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、射频(Radio Frequency,RF)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(Hyper Text Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(Local Area Network,LAN),广域网(Wide Area Network,WAN),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有至少一个程序,当上述至少一个程序被该电子设备执行时,使得该电子设备:
于当前条件满足词库选取条件时,获取通信范围内的字幕文本信息,并提取字幕文本信息的关键词;根据关键词的词向量确定字幕文本信息的表征向量,并从预设行业表征向量中选取与字幕文本信息的表征向量相似的目标行业表征向量;基于目标行业表征向量对应的行业专属词库,对拉取的通信范围内的音频数据进行语音识别。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言-诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言-诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)-连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含至少一个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元、模块的名称在某种情况下并不构成对该单元、模块本身的限定,例如,数据生成模块还可以被描述为“视频数据生成模块”。
本文中以上描述的功能可以至少部分地由至少一个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(Field Programmable Gate Array,FPGA)、专用集成电路(Application Specific Integrated Circuit,ASIC)、专用标准产品(Application Specific Standard Parts,ASSP)、片上系统(System on Chip,SOC)、复杂可编程逻辑设备(Complex Programmable Logic Device,CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于至少一个线的电气连接、便携式计算机盘、硬盘、随 机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的至少一个实施例,【示例一】提供了一种语音识别方法,该方法包括:
于当前条件满足词库选取条件时,获取通信范围内的字幕文本信息,并提取所述字幕文本信息的关键词;
根据所述关键词的词向量确定所述字幕文本信息的表征向量,并从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量;
基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
根据本公开的至少一个实施例,【示例二】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述于当前条件满足词库选取条件时,获取通信范围内的字幕文本信息,包括:
于所述通信范围内积累的字幕文本信息的条数首次达到第一预设数值时,获取所述第一预设数值条通信范围内的字幕文本信息;或者,
从首次确定所述行业专属词库开始,每隔预设时间,获取通信范围内在预设时间内的字幕文本信息;和/或,所述通信范围内每积累第二预设数值条字幕文本信息时,获取所述第二预设数值条通信范围内的字幕文本信息。
根据本公开的至少一个实施例,【示例三】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述提取所述字幕文本信息的关键词,包括:
将所述字幕文本信息中词性为预设词性的词语信息进行抽取,并从抽取的词语信息中过滤预设常用词,将过滤后的词语信息作为关键词。
根据本公开的至少一个实施例,【示例四】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述关键词的数量为至少一个,所述根据所述关键词的词向量确定所述字幕文本信息的表征向量,包括:
对至少一个关键词进行词向量加载,将加载的词向量的平均向量作为所述字幕文本信息的表征向量。
根据本公开的至少一个实施例,【示例五】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述对至少一个关键词进行词向量加载,包括:
判断预设语料库中是否包含所述至少一个关键词;
响应于预设语料库中包含所述至少一个关键词中的全部关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取所述至少一个关键词的词向量;
响应于预设语料库中不包含所述至少一个关键词中的全部关键词,利用预先训练的词向量模型,将所述至少一个关键词转化为词向量;
响应于所述预设语料库中包含所述至少一个关键词中的部分关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取所述部分关键词的词向量,并利用预先训练的词向量模型,将所述至少一个关键词中的没有包含在所述预设语料库中的部分关键词转化为词向量;其中,所述词向量模型基于所述预设语料库训练得到。
根据本公开的至少一个实施例,【示例六】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量,包括:
分别计算所述字幕文本信息的表征向量与候选预设行业表征向量的相似度, 并将最大相似度对应的预设行业表征向量作为目标行业表征向量。
根据本公开的至少一个实施例,【示例七】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别,包括:
将所述目标行业表征向量对应的行业专属词库发送至语音识别引擎;
将拉取的所述通信范围内的音频数据发送至所述语音识别引擎,以使所述语音识别引擎将所述行业专属词库配置为所述通信范围的热词,根据所述通信范围的热词,对所述通信范围内的音频数据进行语音识别。
根据本公开的至少一个实施例,【示例八】提供了一种语音识别方法,还包括:
从预设语料库中选取与所述关键词的词向量相似的词向量,并将所述相似的词向量对应的词语作为所述关键词的相似词语;
基于所述相似词语,对拉取的所述通信范围内的音频数据进行语音识别。
根据本公开的至少一个实施例,【示例九】提供了一种语音识别方法,还包括:
于当前条件满足词库选取条件时,选取最新的目标行业表征向量;
利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库;
相应的,所述基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别,包括:
基于更新后的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
根据本公开的至少一个实施例,【示例十】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库,包括:
判断最新选取的目标行业表征向量,与前一次选取的目标行业表征向量是否相同;
响应于最新选取的目标行业表征向量与前一次选取的目标行业表征向量不相同,利用最新选取的目标行业表征向量对应的行业专属词库,替换前一次选取的目标行业表征向量对应的行业专属词库。
根据本公开的至少一个实施例,【示例十一】提供了一种语音识别方法,还包括:
于当前条件满足词库选取条件时,选取到最新的相似词语;
利用最新选取的相似词语,更新前一次确定的相似词语;
相应的,所述基于所述相似词语,对拉取的所述通信范围内的音频数据进行语音识别,包括:
基于更新后的相似词语,对拉取的所述通信范围内的音频数据进行语音识别。
根据本公开的至少一个实施例,【示例十二】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,所述利用最新选取的相似词语,更新前一次确定的相似词语,包括:
将最新选取的相似词语,与前一次确定的相似词语中相同的词语进行去重;
将去重后的最新选取的相似词语,以及前一次确定的相似词语,作为更新后的相似词语。
根据本公开的至少一个实施例,【示例十三】提供了一种语音识别方法,还包括:
在本公开实施例一些可选的实现方式中,应用于实时通信服务器,且所述 实时通信服务器包括即时通讯服务器、多媒体会议服务器、视频直播服务器和群聊互动服务器中的至少一种。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。

Claims (17)

  1. 一种语音识别方法,包括:
    响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,并提取所述字幕文本信息的关键词;
    根据所述关键词的词向量确定所述字幕文本信息的表征向量,并从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量;
    基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
  2. 根据权利要求1所述的方法,其中,所述响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,包括:
    响应于所述通信范围内积累的字幕文本信息的条数首次达到第一预设数值,获取所述第一预设数值条所述通信范围内的字幕文本信息。
  3. 根据权利要求1所述的方法,其中,所述响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,包括以下至少之一:
    从首次确定所述行业专属词库开始,每隔预设时间,获取通信范围内在预设时间内的字幕文本信息;响应于所述通信范围内每积累第二预设数值条字幕文本信息,获取所述第二预设数值条所述通信范围内的字幕文本信息。
  4. 根据权利要求1所述的方法,其中,所述提取所述字幕文本信息的关键词,包括:
    将所述字幕文本信息中词性为预设词性的词语信息进行抽取,并从抽取的词语信息中过滤预设常用词,将过滤后的词语信息作为关键词。
  5. 根据权利要求1所述的方法,其中,所述关键词的数量为至少一个,所述根据所述关键词的词向量确定所述字幕文本信息的表征向量,包括:
    对至少一个关键词进行词向量加载,将加载的词向量的平均向量作为所述字幕文本信息的表征向量。
  6. 根据权利要求5所述的方法,其中,所述对至少一个关键词进行词向量加载,包括:
    判断预设语料库中是否包含所述至少一个关键词;
    响应于预设语料库中包含所述至少一个关键词中的全部关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取所述至少一个关键词的词向量;
    响应于所述预设语料库中不包含所述至少一个关键词中的所有关键词,利用预先训练的词向量模型,将所述至少一个关键词转化为词向量;
    响应于所述预设语料库中包含所述至少一个关键词中的部分关键词,根据预先配置的语料库-词向量库的对应关系,从预设词向量库中读取所述部分关键词的词向量,并利用预先训练的词向量模型,将所述至少一个关键词中的没有包含在所述预设语料库中的部分关键词转化为词向量;其中,所述词向量模型基于所述预设语料库训练得到。
  7. 根据权利要求1所述的方法,其中,所述从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量,包括:
    分别计算所述字幕文本信息的表征向量与候选预设行业表征向量的相似度,并将最大相似度对应的预设行业表征向量作为目标行业表征向量。
  8. 根据权利要求1所述的方法,其中,所述基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别,包括:
    将所述目标行业表征向量对应的行业专属词库发送至语音识别引擎;
    将拉取的所述通信范围内的音频数据发送至所述语音识别引擎,以使所述语音识别引擎将所述行业专属词库配置为所述通信范围的热词,根据所述通信范围的热词,对所述通信范围内的音频数据进行语音识别。
  9. 根据权利要求1所述的方法,还包括:
    从预设语料库中选取与所述关键词的词向量相似的词向量,并将所述相似的词向量对应的词语作为所述关键词的相似词语;
    基于所述相似词语,对拉取的所述通信范围内的音频数据进行语音识别。
  10. 根据权利要求1所述的方法,还包括:
    响应于当前条件满足词库选取条件,选取最新的目标行业表征向量;
    利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库;
    所述基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别,包括:
    基于更新后的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
  11. 根据权利要求10所述的方法,其中,所述利用最新选取的目标行业表征向量对应的行业专属词库,更新前一次选取的目标行业表征向量对应的行业专属词库,包括:
    判断最新选取的目标行业表征向量,与前一次选取的目标行业表征向量是否相同;
    响应于最新选取的目标行业表征向量与前一次选取的目标行业表征向量不相同,利用最新选取的目标行业表征向量对应的行业专属词库,替换前一次选取的目标行业表征向量对应的行业专属词库。
  12. 根据权利要求9所述的方法,还包括:
    响应于当前条件满足词库选取条件,选取到最新的相似词语;
    利用最新选取的相似词语,更新前一次确定的相似词语;
    所述基于所述相似词语,对拉取的所述通信范围内的音频数据进行语音识别,包括:
    基于更新后的相似词语,对拉取的所述通信范围内的音频数据进行语音识别。
  13. 根据权利要求12所述的方法,其中,所述利用最新选取的相似词语,更新前一次确定的相似词语,包括:
    将最新选取的相似词语,与前一次确定的相似词语中相同的词语进行去重;
    将去重后的最新选取的相似词语,以及前一次确定的相似词语,作为更新后的相似词语。
  14. 根据权利要求1-13任一项所述的方法,其中,应用于实时通信服务器,且所述实时通信服务器包括即时通讯服务器、多媒体会议服务器、视频直播服务器和群聊互动服务器中的至少一种。
  15. 一种语音识别装置,包括:
    关键词提取模块,设置为响应于当前条件满足词库选取条件,获取通信范围内的字幕文本信息,并提取所述字幕文本信息的关键词;
    行业表征向量选取模块,设置为根据所述关键词的词向量确定所述字幕文本信息的表征向量,并从预设行业表征向量中选取与所述字幕文本信息的表征向量相似的目标行业表征向量;
    语音识别模块,设置为基于所述目标行业表征向量对应的行业专属词库,对拉取的所述通信范围内的音频数据进行语音识别。
  16. 一种电子设备,包括:
    至少一个处理器;
    存储装置,设置为存储至少一个程序,
    当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-14中任一所述的语音识别方法。
  17. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-14中任一所述的语音识别方法。
PCT/CN2021/112754 2020-08-20 2021-08-16 一种语音识别方法、装置、电子设备及存储介质 WO2022037526A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010842909.7 2020-08-20
CN202010842909.7A CN112037792B (zh) 2020-08-20 2020-08-20 一种语音识别方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2022037526A1 true WO2022037526A1 (zh) 2022-02-24

Family

ID=73579934

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112754 WO2022037526A1 (zh) 2020-08-20 2021-08-16 一种语音识别方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN112037792B (zh)
WO (1) WO2022037526A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996506A (zh) * 2022-05-24 2022-09-02 腾讯科技(深圳)有限公司 语料生成方法、装置、电子设备和计算机可读存储介质

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111161739B (zh) * 2019-12-28 2023-01-17 科大讯飞股份有限公司 语音识别方法及相关产品
CN112037792B (zh) * 2020-08-20 2022-06-17 北京字节跳动网络技术有限公司 一种语音识别方法、装置、电子设备及存储介质
CN112509567B (zh) * 2020-12-25 2024-05-10 阿波罗智联(北京)科技有限公司 语音数据处理的方法、装置、设备、存储介质及程序产品
CN113077789A (zh) * 2021-03-29 2021-07-06 南北联合信息科技有限公司 语音实时转化方法、系统、计算机设备及存储介质
CN113241070B (zh) * 2021-04-28 2024-02-27 北京字跳网络技术有限公司 热词召回及更新方法、装置、存储介质和热词系统
CN113377904B (zh) * 2021-06-04 2024-05-10 百度在线网络技术(北京)有限公司 行业动作识别方法、装置、电子设备及存储介质
CN113674743A (zh) * 2021-08-20 2021-11-19 云知声(上海)智能科技有限公司 用于自然语言处理中asr结果替换处理设备及处理方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103236260A (zh) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 语音识别系统
CN103514170A (zh) * 2012-06-20 2014-01-15 中国移动通信集团安徽有限公司 一种语音识别的文本分类方法和装置
KR20180110972A (ko) * 2017-03-30 2018-10-11 엘지전자 주식회사 음성 인식 방법
CN109190125A (zh) * 2018-09-14 2019-01-11 广州达美智能科技有限公司 医学语言文本的处理方法、装置和存储介质
CN109410923A (zh) * 2018-12-26 2019-03-01 中国联合网络通信集团有限公司 语音识别方法、装置、系统及存储介质
CN112037792A (zh) * 2020-08-20 2020-12-04 北京字节跳动网络技术有限公司 一种语音识别方法、装置、电子设备及存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU7978098A (en) * 1997-07-07 1999-02-08 Motorola, Inc. Modular speech recognition system and method
EP2058800B1 (en) * 2007-10-24 2010-09-01 Harman Becker Automotive Systems GmbH Method and system for recognizing speech for searching a database
CN102867511A (zh) * 2011-07-04 2013-01-09 余喆 自然语音识别方法和装置
CN103700370B (zh) * 2013-12-04 2016-08-17 北京中科模识科技有限公司 一种广播电视语音识别系统方法及系统
CN106683662A (zh) * 2015-11-10 2017-05-17 中国电信股份有限公司 一种语音识别方法和装置
CN106528588A (zh) * 2016-09-14 2017-03-22 厦门幻世网络科技有限公司 一种为文本信息匹配资源的方法及装置
US20180143970A1 (en) * 2016-11-18 2018-05-24 Microsoft Technology Licensing, Llc Contextual dictionary for transcription
WO2019171128A1 (en) * 2018-03-06 2019-09-12 Yogesh Chunilal Rathod In-media and with controls advertisement, ephemeral, actionable and multi page photo filters on photo, automated integration of external contents, automated feed scrolling, template based advertisement post and actions and reaction controls on recognized objects in photo or video
CN108847241B (zh) * 2018-06-07 2022-09-13 平安科技(深圳)有限公司 将会议语音识别为文本的方法、电子设备及存储介质
KR20200074349A (ko) * 2018-12-14 2020-06-25 삼성전자주식회사 음성을 인식하기 위한 방법 및 장치
CN109727598A (zh) * 2018-12-28 2019-05-07 浙江省公众信息产业有限公司 大噪音语境下的意图识别方法
CN110544477A (zh) * 2019-09-29 2019-12-06 北京声智科技有限公司 一种语音识别方法、装置、设备及介质
CN111274783B (zh) * 2020-01-14 2022-12-06 广东电网有限责任公司广州供电局 一种基于语义相似分析的围串标智能识别方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514170A (zh) * 2012-06-20 2014-01-15 中国移动通信集团安徽有限公司 一种语音识别的文本分类方法和装置
CN103236260A (zh) * 2013-03-29 2013-08-07 京东方科技集团股份有限公司 语音识别系统
KR20180110972A (ko) * 2017-03-30 2018-10-11 엘지전자 주식회사 음성 인식 방법
CN109190125A (zh) * 2018-09-14 2019-01-11 广州达美智能科技有限公司 医学语言文本的处理方法、装置和存储介质
CN109410923A (zh) * 2018-12-26 2019-03-01 中国联合网络通信集团有限公司 语音识别方法、装置、系统及存储介质
CN112037792A (zh) * 2020-08-20 2020-12-04 北京字节跳动网络技术有限公司 一种语音识别方法、装置、电子设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996506A (zh) * 2022-05-24 2022-09-02 腾讯科技(深圳)有限公司 语料生成方法、装置、电子设备和计算机可读存储介质

Also Published As

Publication number Publication date
CN112037792A (zh) 2020-12-04
CN112037792B (zh) 2022-06-17

Similar Documents

Publication Publication Date Title
WO2022037526A1 (zh) 一种语音识别方法、装置、电子设备及存储介质
WO2022042512A1 (zh) 文本处理方法、装置、电子设备及介质
WO2022057637A1 (zh) 语音翻译方法、装置、设备和存储介质
EP3734596B1 (en) Determining target device based on speech input of user and controlling target device
JP6820058B2 (ja) 音声認識方法、装置、デバイス、及び記憶媒体
EP4206952A1 (en) Interactive information processing method and apparatus, device and medium
JP7143327B2 (ja) コンピューティング装置によって実施される方法、コンピュータシステム、コンピューティングシステム、およびプログラム
WO2023083142A1 (zh) 分句方法、装置、存储介质及电子设备
WO2022228041A1 (zh) 翻译模型的训练方法、装置、设备和存储介质
CN111919249A (zh) 词语的连续检测和相关的用户体验
WO2022127620A1 (zh) 语音唤醒方法、装置、电子设备及存储介质
US11354520B2 (en) Data processing method and apparatus providing translation based on acoustic model, and storage medium
WO2022042609A1 (zh) 提取热词的方法、装置、电子设备及介质
US20150046170A1 (en) Information processing device, information processing method, and program
CN110990598B (zh) 资源检索方法、装置、电子设备及计算机可读存储介质
CN112634872A (zh) 语音设备唤醒方法及装置
WO2019101099A1 (zh) 视频节目识别方法、设备、终端、系统和存储介质
CN112242143B (zh) 一种语音交互方法、装置、终端设备及存储介质
CN109286554B (zh) 社交应用中社交功能解锁方法及装置
CN110379406B (zh) 语音评论转换方法、系统、介质和电子设备
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
WO2023207690A1 (zh) 一种文本生成方法、装置、电子设备及介质
CN113033190B (zh) 字幕生成方法、装置、介质及电子设备
CN111460214B (zh) 分类模型训练方法、音频分类方法、装置、介质及设备
US11836161B2 (en) Systems and methods for predicting where conversations are heading and identifying associated content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21857621

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21857621

Country of ref document: EP

Kind code of ref document: A1