WO2020244153A1 - 会议语音数据处理方法、装置、计算机设备和存储介质 - Google Patents

会议语音数据处理方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2020244153A1
WO2020244153A1 PCT/CN2019/118267 CN2019118267W WO2020244153A1 WO 2020244153 A1 WO2020244153 A1 WO 2020244153A1 CN 2019118267 W CN2019118267 W CN 2019118267W WO 2020244153 A1 WO2020244153 A1 WO 2020244153A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
text
information
text information
keywords
Prior art date
Application number
PCT/CN2019/118267
Other languages
English (en)
French (fr)
Inventor
陈家荣
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020244153A1 publication Critical patent/WO2020244153A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques

Definitions

  • This application relates to a method, device, computer equipment and storage medium for processing conference voice data.
  • a method, apparatus, computer equipment, and storage medium for processing conference voice data are provided.
  • a method for processing conference voice data includes:
  • the corresponding conference report data is generated in a preset manner according to the conference theme and the text information added with keywords.
  • a conference voice data processing device includes:
  • a request receiving module configured to receive a meeting record request sent by a terminal, and send a recording instruction to the terminal according to the meeting record request, where the meeting record request carries a meeting subject;
  • a data acquisition module for acquiring voice data uploaded by the terminal according to the recording instruction
  • the feature extraction module is used to perform feature extraction on the voice data to obtain multiple voice feature information
  • the voiceprint recognition module is used to input the multiple voice feature information into the trained voiceprint recognition model for voiceprint recognition, to obtain multiple voice fragments and corresponding voiceprint identifiers, and to compare all the information according to the voiceprint identifiers. Said multiple speech fragments are converted into corresponding text information;
  • a semantic analysis module used to input the text information into a trained semantic analysis model, analyze keywords and correction information in the text information, and generate analysis results using the keywords and correction information;
  • a text correction module configured to correct the text information according to the correction information in the analysis result, and add corresponding keywords to the corrected text information
  • the conference report generating module is used to generate corresponding conference report data in a preset manner according to the text information of the conference theme and the keyword added.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the computer-readable instructions are executed by the one or more processors, the one or more Each processor performs the following steps:
  • the corresponding conference report data is generated in a preset manner according to the conference theme and the text information added with keywords.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to perform the following steps:
  • the corresponding conference report data is generated in a preset manner according to the conference theme and the text information added with keywords.
  • Fig. 1 is an application scenario diagram of a method for processing conference voice data according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for processing conference voice data according to one or more embodiments.
  • FIG. 3 is a schematic flowchart of a step of performing voiceprint recognition on voice feature information through a voiceprint recognition model according to one or more embodiments.
  • Fig. 4 is a schematic flowchart of the steps of analyzing text information through a semantic analysis model according to one or more embodiments.
  • Fig. 5 is a block diagram of an apparatus for processing conference voice data according to one or more embodiments.
  • Figure 6 is a block diagram of a computer device according to one or more embodiments.
  • the conference voice data processing method provided in this application can be applied to the application environment shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • the user can send a meeting record request to the server 104 through the corresponding terminal 102 to record the voice during the meeting.
  • the meeting record request carries the meeting subject.
  • the server 104 After the server 104 receives the meeting recording request sent by the terminal 102, it sends a recording instruction to the terminal 102 according to the meeting recording request, and the terminal 102 records and uploads the voice data during the meeting according to the recording instruction.
  • the server 104 obtains the voice data uploaded by the terminal 102 according to the recording instruction, performs feature extraction on the voice data, obtains multiple voice feature information, obtains a preset voiceprint recognition model, and performs the voice feature information on the voiceprint recognition model. Voiceprint recognition can effectively obtain multiple voice segments and effectively recognize corresponding voiceprint identifiers. The server 104 then converts the multiple voice segments into corresponding text information according to the voiceprint identifiers.
  • the server 104 further obtains a preset semantic analysis model, and performs contextual semantic analysis on the text information through the semantic analysis model, thereby being able to accurately and effectively analyze the keywords and correction information in the text information, and correct the text information according to the correction information , And add corresponding keywords to the corrected text information, and then generate corresponding meeting report data in a preset manner according to the meeting theme and the text information of the added keywords.
  • a method for processing conference voice data is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • Step 202 Receive a meeting record request sent by the terminal, and send a recording instruction to the terminal according to the meeting record request.
  • the meeting record request carries the meeting subject.
  • Step 204 Acquire voice data uploaded by the terminal according to the recording instruction.
  • Users can register on the application in advance, and collect each user's voice for voiceprint recognition to verify identity.
  • the server uses the registered voiceprint information and user identification of multiple users to generate a voiceprint model library.
  • a user holds a meeting, he can use the terminal to record the meeting voice information during the meeting.
  • the terminal can initiate a meeting record request to the server, and the meeting record request carries the meeting keyword.
  • the server After the server receives the meeting record request sent by the terminal, it sends a recording instruction to the terminal, and the terminal then records according to the recording instruction, and uploads the recorded voice data to the server in real time.
  • Step 206 Perform feature extraction on the voice data to obtain multiple voice feature information.
  • the server After the server receives the voice data uploaded by the terminal, it preprocesses the voice signal. For example, the server can obtain the voice signal in the voice data, and perform preprocessing such as noise suppression on the voice signal in the voice data to obtain the preprocessed voice signal . The server further performs feature extraction on the preprocessed voice signal data, and performs voice endpoint detection on the voice signal after the feature extraction. The server divides the voice data into multiple voice feature information according to the voice endpoint.
  • Step 208 Input multiple voice feature information into the trained voiceprint recognition model for voiceprint recognition to obtain multiple voice segments and corresponding voiceprint identifiers, and convert the multiple voice segments into corresponding texts according to the voiceprint identifiers information.
  • the server further obtains the preset voiceprint recognition model, inputs the pre-processed voice signal data into the pre-trained voiceprint recognition model, and calculates the feature parameter sequence corresponding to multiple voice feature information through the voiceprint recognition model, and then calculates the feature parameter sequence corresponding to multiple voice feature information according to the feature parameter
  • the similarity of the sequence divides the voice signal into multiple voice segments, and the server performs matching in the voiceprint model library to obtain the corresponding matching voiceprint identification, which can use the voiceprint recognition model to recognize multiple voice segments and corresponding Voiceprint logo.
  • the server further converts the voice signal into corresponding text information according to the recognized voiceprint identifier.
  • Step 210 Input the text information into the trained semantic analysis model, analyze the keywords and correction information in the text information, and use the keywords and correction information to generate an analysis result.
  • the server After the server converts the voice signal into corresponding text information, it further obtains a preset semantic analysis model.
  • the semantic analysis model may be a semantic analysis model trained in advance using a large amount of corpus data.
  • the server inputs the recognized text information into the trained semantic analysis model, analyzes the recognized text information according to the context semantics through the semantic analysis model, analyzes the ambiguous or unclear text information, and then analyzes the correction information.
  • the correction information may include incorrect text, replacement text, and corresponding text position.
  • the server also performs context analysis on the text information through the semantic analysis model, and identifies keywords that appear frequently in the text information, obtains the analyzed keywords, and then obtains the analysis results containing the keywords and correction information.
  • Step 212 Correct the text information according to the correction information in the analysis result, and add corresponding keywords to the corrected text information.
  • Step 214 Generate corresponding conference report data in a preset manner according to the conference theme and the text information added with keywords.
  • the server After the server analyzes the text information through the semantic analysis model and obtains the analysis result, it adjusts and corrects the contextual ambiguity or unclear text of the text information according to the correction information.
  • the server adds keywords in the corresponding positions of the text information according to the analyzed keywords.
  • the server can also adjust the text information according to the preset characters corresponding to the keywords to obtain the summarized text information.
  • the server After the server converts all the voice data in the conference process into corresponding text information, the server further obtains the preset conference report template according to the conference theme, and generates it in a preset manner according to the conference report target and the text information after adding keywords Corresponding meeting report data.
  • the server can accurately and effectively identify each speaker and the corresponding voice in the meeting by performing voice recognition and voiceprint recognition on the voice data in the meeting.
  • the server converts the voice data into corresponding text information according to the user identification, and according to the expected Set up a method to generate the corresponding meeting report data from the text information, which can effectively generate the corresponding meeting report data.
  • the server after the server receives the voice data uploaded by the terminal, it performs feature extraction on the voice data, obtains multiple voice feature information, obtains a preset voiceprint recognition model, and compares the voiceprint recognition model to the voiceprint recognition model.
  • Voice feature information is used for voiceprint recognition, so that multiple voice segments can be effectively obtained and corresponding voiceprint identifiers can be effectively recognized.
  • the server then converts the multiple voice segments into corresponding text information according to the voiceprint identifiers.
  • the server further obtains the preset semantic analysis model, and performs contextual semantic analysis on the text information through the semantic analysis model, which can accurately and effectively analyze the keywords and correction information in the text information, and correct the text information according to the correction information.
  • the step of performing feature extraction on the voice data to obtain multiple voice feature information includes: obtaining a voice signal of the voice data, framing and windowing the voice signal, and extracting corresponding acoustic features and spectral features; Convert the acoustic feature and spectrum feature to obtain the corresponding acoustic feature vector and spectrum feature vector; input the acoustic feature vector and spectrum feature vector to the trained voice endpoint detection model, and detect multiple occurrences of the voice signal through the voice endpoint detection model Starting point and ending point: According to multiple starting points and ending points of the voice signal, the voice data is divided into multiple voice feature information.
  • the server After receiving the voice data uploaded by the terminal, the server performs acoustic feature extraction on the voice data. Specifically, the server extracts the voice signal in the voice data, and the voice signal in the voice data uploaded by the terminal is usually a noisy voice signal with noise. After the server obtains the voice signal, it performs windowing and framing of the voice signal, extracts the corresponding acoustic feature and spectral feature, and converts the acoustic feature and the spectral feature to obtain the corresponding acoustic feature vector and spectral feature vector.
  • the server further obtains a preset voice endpoint detection model, and the voice endpoint detection model may be a model that has been trained in advance.
  • the server inputs the acoustic feature vector and the spectral feature vector into the voice endpoint detection model, and classifies the input acoustic feature vector and the spectral feature vector through the voice endpoint detection model, and the decision value corresponding to the acoustic feature vector and the spectral feature vector can be obtained.
  • the obtained decision value is the preset first threshold
  • a voice label is added to the acoustic feature vector or the spectral feature vector.
  • the first threshold may be a range value.
  • a non-speech label is added to the acoustic feature vector or the spectral feature vector. Then, the acoustic feature vector with added voice tag and the spectral feature vector with added voice tag can be obtained. Analyze the acoustic feature vector of the added voice tag and the spectral feature vector of the added voice tag to obtain the voice signal of the added voice tag. According to the time sequence of the voice signal and the added voice tag, determine the multiple starting and ending points of the voice signal, and then The multiple starting points and ending points of the voice signal divide the voice data into multiple voice feature information. By using the voice endpoint detection model to perform endpoint detection and classification on the voice signal, the voice signal and the non-voice signal in the noisy voice signal can be accurately identified, and the voice feature information in the voice data can be effectively extracted.
  • the method further includes: acquiring a plurality of voice sample data, the voice sample data includes labeled sample data and unlabeled sample data, using the labeled sample data to generate a training set, and using the unlabeled sample data to generate a verification set ; Input the voice sample data in the training set into the preset voiceprint recognition model for training to obtain the initial voiceprint recognition model; input the voice sample data in the verification set into the initial voiceprint recognition model for training and verification; and When the number of voice sample data in the verification set that meets the preset matching degree value reaches the target threshold, the training is stopped, and the trained voiceprint recognition model is obtained.
  • the server Before the server obtains the preset voiceprint recognition model, it needs to construct the voiceprint recognition model in advance. Specifically, the server may first obtain a large amount of voice sample data.
  • the voice sample data includes labeled sample data and unlabeled sample data.
  • the labeled sample data is voice sample data that has been pre-labeled with a voiceprint identifier.
  • the server divides the voice sample data into a training set and a verification set. Specifically, the server uses the labeled sample data to generate a training set, the training set contains labeled voice sample data, and the server uses unlabeled sample data to generate a verification set.
  • the server inputs the voice sample data in the training set into the preset voiceprint recognition model for training, and obtains the initial voiceprint recognition model after training.
  • the server then inputs the voice sample data in the verification set into the initial voiceprint recognition model Carry out continuous training and verification. Until the number of voice sample data meeting the preset matching degree value in the verification set reaches the preset threshold, the training is stopped, and the trained voiceprint recognition model is obtained. The server further inputs the user's voiceprint in the voiceprint model library into the trained voiceprint recognition model, thereby effectively constructing a voiceprint recognition model with higher recognition accuracy.
  • the step of performing voiceprint recognition on voice feature information through a voiceprint recognition model specifically includes the following content:
  • Step 302 Calculate feature parameter sequences in multiple voice feature information through the voiceprint recognition model.
  • Step 304 Segment the voice feature information according to the feature parameter sequence to obtain multiple voice segments.
  • Step 306 Calculate the similarity of the feature parameter sequences of the multiple voice segments, and classify the multiple voice segments whose similarity reaches a preset threshold.
  • Step 308 Perform matching in the voiceprint model library according to the categorized feature parameter sequence, and add a matching voiceprint identifier to the categorized speech segment.
  • the server After the server performs feature extraction on the voice signal data to obtain multiple voice feature information, it obtains a preset voiceprint recognition model.
  • the voiceprint recognition model may be a model obtained by the server through training using a large amount of voice sample data in advance.
  • the server inputs multiple voice feature information into the voiceprint recognition model, and performs voiceprint recognition on the voice feature information through the voiceprint recognition model.
  • the server calculates the feature parameter sequence in each voice feature information through the voiceprint recognition model, and each voice feature sequence may include voice segments of different feature parameter sequences.
  • the server divides the voice feature information into multiple voice segments according to different feature parameter sequences, and classifies voice segments with the same feature parameter sequence. For example, multiple voice segments with the same feature parameter sequence can be classified into the same category. .
  • the server then performs matching in the voiceprint model library according to the classified feature parameter sequence, and the voiceprint model library pre-stores the voiceprint features and corresponding voiceprint identifiers corresponding to multiple speakers.
  • the server matches the feature parameter sequences of multiple voice segments with the voiceprint features in the voiceprint model library, and extracts the voiceprint identifier corresponding to the feature parameter sequence with the highest matching degree, which can then effectively match each voice segment.
  • the server adds the recognized corresponding voiceprint identifier to each voice segment, thereby effectively identifying the voice segment information corresponding to each speaker in the voice data.
  • the feature parameter sequence corresponding to each voice feature segment is calculated through the voiceprint recognition model, and the feature parameter sequence is compared with the feature parameter sequence in the preset voiceprint model library to extract the feature parameter sequence with the highest matching degree.
  • the user's corresponding voiceprint identification, and the user identification is added to the corresponding voice feature segment, so that each speaker and the corresponding voice in the meeting can be accurately and effectively identified.
  • the server After the server converts multiple speech fragments into corresponding text information, it further obtains a preset semantic analysis model, performs contextual semantic analysis on the text information through the semantic analysis model, and obtains an analysis result including keywords and correction information.
  • the server corrects the text information according to the correction information, adds corresponding keywords to the corrected text information, and then generates corresponding meeting report data in a preset manner according to the meeting theme and the text information of the added keywords. Therefore, the corresponding conference report data can be effectively generated, and the processing efficiency and recognition accuracy of the conference voice data can be effectively improved.
  • the steps of performing contextual semantic analysis on text information through a semantic analysis model specifically include the following:
  • Step 402 Perform context analysis on the text information through a semantic analysis model to obtain the semantics and word frequencies of multiple texts.
  • Step 404 Identify keywords in the text information according to the semantics and word frequencies of the multiple texts.
  • Step 406 Analyze the text to be corrected and the corresponding text position in the text information according to the semantics and keywords of the multiple texts, and determine the corrected text corresponding to the text to be corrected according to the semantics of the multiple texts.
  • Step 408 Generate correction information according to the text to be corrected, the corrected text and the corresponding text position, and generate an analysis result of the text information using the keywords and the correction information.
  • the server After the server receives the voice data uploaded by the terminal, it performs feature extraction on the voice data, and after obtaining multiple voice feature information, obtains a preset voiceprint recognition model, and performs voiceprint recognition on the voice feature information through the voiceprint recognition model. In this way, multiple voice segments can be effectively obtained and corresponding voiceprint identifiers can be effectively recognized, and the server then converts the multiple voice segments into corresponding text information according to the voiceprint identifiers.
  • the server After the server obtains the text information corresponding to the voice data, it further obtains a preset semantic analysis model, and performs contextual semantic analysis on the text information through the semantic analysis model.
  • the semantic analysis model may be a machine learning model based on a neural network.
  • the server performs context analysis on the text information through the semantic analysis model, analyzes the semantics of multiple texts, and recognizes texts that appear frequently in the text information, and obtains analyzed keywords based on the semantics and word frequencies of multiple texts.
  • the word frequency can be the frequency of a certain word or phrase in the text information.
  • the server further analyzes the ambiguous or unclear text information according to the semantics and keywords of multiple texts through the semantic analysis model, and then analyzes the text to be corrected and the corresponding text position in the text information, and determines the text to be corrected according to the semantics of the multiple texts.
  • the correction text corresponding to the correction text.
  • the server generates correction information according to the text to be corrected, the corrected text, and the corresponding text position.
  • the server uses keywords and correction information to correct the text information, and adds corresponding keywords to the corrected text information, and then generates corresponding meeting report data in a preset manner according to the meeting theme and the text information of the added keywords. This can effectively generate the corresponding conference report data, thereby effectively improving the processing efficiency and recognition accuracy of the conference voice data.
  • the step of correcting the text information according to the correction information includes: determining the position to be corrected in the text information according to the correction information; using the correction text in the correction information to replace the text to be corrected at the position to be corrected and delete The text to be corrected, and the corrected text information is obtained.
  • the server After the server converts the voice data into corresponding text information, it further obtains a preset semantic analysis model, and performs contextual semantic analysis on the text information through the semantic analysis model.
  • the semantic analysis model may be a machine learning model based on a neural network.
  • the server performs context analysis on the text information through the semantic analysis model, analyzes the semantics of multiple texts, and recognizes texts that appear frequently in the text information, and obtains analyzed keywords based on the semantics and word frequencies of multiple texts.
  • the server further analyzes the ambiguous or unclear text information based on the semantics and keywords of multiple texts through the semantic analysis model, and then analyzes the correction information in the text information.
  • the correction information can include the text to be corrected, the corrected text and the corresponding The text position.
  • the server uses keywords and correction information to correct the text information. Specifically, the server determines the position to be corrected in the text information according to the correction information, uses the correction text in the correction information to replace the text to be corrected at the position to be corrected, and deletes the text to be corrected. Correct the text, thereby modifying the text to be corrected into the corrected text, and then obtain the corrected text information.
  • the server adds a corresponding keyword to the corrected text information. Specifically, the server adds the keyword to the text area corresponding to the keyword according to the recognized keyword.
  • the server further generates corresponding meeting report data in a preset manner according to the meeting theme and the text information of the added keywords, thereby being able to effectively generate the corresponding meeting report data. Analyze the semantics and word frequency of multiple texts through the semantic analysis model to analyze the keywords and correction information, and adjust and correct the text information according to the keywords and correction information, which can effectively improve the recognition accuracy of conference speech data .
  • the method further includes: receiving a query request sent by the terminal, the query request carrying a keyword; obtaining the conference text content associated with the keyword according to the keyword; sending the text content to the terminal in a preset manner , And display it.
  • the server After the server generates the corresponding meeting report data from the voice data during the meeting, it will be stored as the report data.
  • Users can use keywords to query the corresponding meeting text content in the meeting report data.
  • the user can send a query request to the server through the corresponding user terminal, and the query request carries the conference subject and keywords.
  • the keyword may also include a user identification, and the user identification and the voiceprint identification may be consistent.
  • the server After receiving the query request sent by the user terminal, the server obtains the conference text content associated with the keyword in the conference report data from the database according to the conference theme and keywords.
  • the text content of the conference is sent to the user terminal in a preset manner, for example, the text content can be highlighted. This can effectively enable the user to quickly and conveniently understand the content of the meeting required by the user.
  • a conference speech data processing device including: a data acquisition module 502, a feature extraction module 504, a voiceprint recognition module 506, a semantic analysis module 508, and a text correction module 510 And meeting report generating module 512, where:
  • the data acquisition module 502 is configured to receive a meeting record request sent by the terminal, and send a recording instruction to the terminal according to the meeting record request.
  • the meeting record request carries the subject of the meeting; and obtain the voice data uploaded by the terminal according to the recording instruction;
  • the feature extraction module 504 is configured to perform feature extraction on voice data to obtain multiple voice feature information
  • the voiceprint recognition module 506 is used to input multiple voice feature information into the trained voiceprint recognition model for voiceprint recognition, to obtain multiple voice segments and corresponding voiceprint identifiers, and to combine multiple voice segments according to the voiceprint identifiers Convert to corresponding text information;
  • the semantic analysis module 508 is configured to input the text information into the trained semantic analysis model, analyze keywords and correction information in the text information, and generate analysis results using the keywords and correction information;
  • the text correction module 510 is configured to correct the text information according to the correction information in the analysis result, and add corresponding keywords to the corrected text information;
  • the conference report generating module 512 is configured to generate corresponding conference report data in a preset manner according to the text information of the conference theme and the keyword added.
  • the feature extraction module 504 is also used to obtain the voice signal of the voice data, to frame and window the voice signal, to extract the corresponding acoustic feature and spectral feature; to convert the acoustic feature and the spectral feature to obtain Corresponding acoustic feature vector and spectral feature vector; input the acoustic feature vector and spectral feature vector to the trained voice endpoint detection model, and detect multiple starting and ending points of the voice signal through the voice endpoint detection model; according to the number of voice signals A starting point and an ending point divide the voice data into multiple voice feature information.
  • the device further includes a voiceprint recognition model training module for acquiring multiple voice sample data.
  • the voice sample data includes labeled sample data and unlabeled sample data, and the labeled sample data is used to generate a training set.
  • Use unlabeled sample data to generate a verification set; input the voice sample data in the training set into the preset voiceprint recognition model for training to obtain the initial voiceprint recognition model; input the voice sample data in the verification set to the initial voiceprint recognition model Training and verification are performed in the verification set; and until the number of voice sample data in the verification set that meets the preset matching value reaches the target threshold, the training is stopped, and the trained voiceprint recognition model is obtained.
  • the voiceprint recognition module 506 is further configured to calculate feature parameter sequences in multiple voice feature information through the voiceprint recognition model; segment the voice feature information according to the feature parameter sequence to obtain multiple voice segments; Calculate the similarity of the feature parameter sequences of multiple speech fragments, classify multiple speech fragments whose similarity reaches a preset threshold; match the classified feature parameter sequences in the voiceprint model library, and then classify them Add the matching voiceprint mark to the voice segment.
  • the semantic analysis module 508 is also used to perform context analysis on the text information through the semantic analysis model to obtain the semantics and word frequencies of multiple texts; identify keywords in the text information according to the semantics and word frequencies of the multiple texts Analyze the text to be corrected and the corresponding text position in the text information according to the semantics and keywords of multiple texts, and determine the correction text corresponding to the text to be corrected according to the semantics of the multiple texts; and according to the text to be corrected, the corrected text and the corresponding The text position generates correction information, and the analysis result of the text information is generated using keywords and correction information.
  • the text correction module 510 is further configured to determine the position to be corrected in the text information according to the correction information; and use the correction text in the correction information to replace the text to be corrected at the position to be corrected, and delete the text to be corrected, Get the corrected text information.
  • the device further includes a query module, configured to receive a query request sent by the terminal, the query request carries a conference subject and keywords; and obtain the conference text content associated with the keywords according to the conference subject and keywords; The text content is sent to the terminal in a preset manner, and displayed in the preset manner.
  • a query module configured to receive a query request sent by the terminal, the query request carries a conference subject and keywords; and obtain the conference text content associated with the keywords according to the conference subject and keywords; The text content is sent to the terminal in a preset manner, and displayed in the preset manner.
  • the various modules in the above-mentioned conference voice data processing can be implemented in whole or in part by software, hardware and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6.
  • the computer equipment includes a processor, a memory, a network interface and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store data such as voice data, voice feature information, voiceprint model library, text information, and conference report data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • FIG. 6 is only a block diagram of part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer readable instructions.
  • the one or more processors perform the following steps:
  • the corresponding conference report data is generated according to the conference theme and the text information of the added keywords in a preset manner.
  • One or more non-volatile computer-readable storage media storing computer-readable instructions.
  • the one or more processors execute the following steps:
  • the corresponding conference report data is generated according to the conference theme and the text information of the added keywords in a preset manner.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • ROM read only memory
  • PROM programmable ROM
  • EPROM electrically programmable ROM
  • EEPROM electrically erasable programmable ROM
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

一种会议语音数据处理方法、装置、计算机设备和存储介质,所述方法包括:接收终端发送的会议记录请求,根据会议记录请求向终端发送录音指令,获取终端上传的语音数据;对语音数据进行特征提取,得到多个语音特征信息;通过预设的声纹识别模型对语音特征信息进行声纹识别,得到多个语音片段和对应的声纹标识,并将多个语音片段转换为对应的文本信息;通过预设的语义分析模型分析出文本信息中的关键字和校正信息,根据校正信息对文本信息进行校正,并对校正后的文本信息添加对应的关键字;根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。

Description

会议语音数据处理方法、装置、计算机设备和存储介质
相关申请的交叉引用:
本申请要求于2019年06月05日提交至中国专利局,申请号为2019104945807,申请名称为“会议语音数据处理方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种会议语音数据处理方法、装置、计算机设备和存储介质。
背景技术
随着经济和互联网的迅速发展,商业社会和商务模式也在迅速发展,各种商务会议的召开需求也不断增大,会议中通常包含有价值的会议信息需要记录下来。
然而,传统的会议中是通过人工记录会议纪要,会议记录效率较低。随着人工智能技术的迅速发展,出现了一些从音视频会议中提取会议纪要的方式,这种方式通常是从会议音视频中提取语音信息,再通过语音识别转换为对应的文本信息进行存储。但这种方式在时间较长会议内容较多的情况下,生成的文本信息内容较多较繁琐,无法区分具体的会议内容来自哪个发言人,导致会议记录的效率较低,会议语音数据的识别准确率也较低。因此,如何有效提高会议语音数据的识别准确率成为目前需要解决的技术问题。
发明内容
根据本申请公开的各种实施例,提供一种会议语音数据处理方法、装置、计算机设备和存储介质。
一种会议语音数据处理方法包括:
接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
获取所述终端根据所述录音指令上传的语音数据;
对所述语音数据进行特征提取,得到多个语音特征信息;
将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音 片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;及
根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
一种会议语音数据处理装置包括:
请求接收模块,用于接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
数据获取模块,用于获取所述终端根据所述录音指令上传的语音数据;
特征提取模块,用于对所述语音数据进行特征提取,得到多个语音特征信息;
声纹识别模块,用于将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
语义分析模块,用于将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
文本校正模块,用于根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;
会议报告生成模块,用于根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
获取所述终端根据所述录音指令上传的语音数据;
对所述语音数据进行特征提取,得到多个语音特征信息;
将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信 息,利用所述关键字和校正信息生成分析结果;
根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;及
根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
获取所述终端根据所述录音指令上传的语音数据;
对所述语音数据进行特征提取,得到多个语音特征信息;
将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;及
根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为根据一个或多个实施例中会议语音数据处理方法的应用场景图。
图2为根据一个或多个实施例中会议语音数据处理方法的流程示意图。
图3为根据一个或多个实施例中通过声纹识别模型对语音特征信息进行声纹识别步骤的流程示意图。
图4为根据一个或多个实施例中通过语义分析模型对文本信息进行分析步骤的流程示意图。
图5为根据一个或多个实施例中会议语音数据处理装置的框图。
图6为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的会议语音数据处理方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104进行通信。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。用户可以通过对应的终端102向服务器104发送会议记录请求,以对会议过程中的语音进行记录,会议记录请求携带了会议主题。服务器104接收到终端102发送的会议记录请求后,根据会议记录请求向终端102发送录音指令,终端102则根据录音指令录制会议过程中的语音数据并上传。服务器104则获取终端102根据录音指令上传的语音数据,对语音数据进行特征提取,得到多个语音特征信息后,获取预设的声纹识别模型,通过声纹识别模型对所述语音特征信息进行声纹识别,由此能够有效得到多个语音片段和有效识别出对应的声纹标识,服务器104进而根据声纹标识将多个语音片段转换为对应的文本信息。服务器104进一步获取预设的语义分析模型,通过语义分析模型对文本信息进行上下文语义分析,由此能够准确有效地分析出文本信息中的关键字和校正信息,并根据校正信息对文本信息进行校正,并对校正后的文本信息添加对应的关键字,进而根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
在其中一个实施例中,如图2所示,提供了一种会议语音数据处理方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤202,接收终端发送的会议记录请求,根据会议记录请求向终端发送录音指令,会议记录请求携带了会议主题。
步骤204,获取终端根据录音指令上传的语音数据。
用户可以预先在应用程序上进行注册,并采集每个用户的语音进行声纹识别以验证身份。服务器则利用多个用户注册后的声纹信息和用户标识生成声纹模型库。用户在召开会议的时候,可以通过终端记录会议过程中的会议语音信息。终端可以向服务器发起会议记录请求,会议记录请求中携带了会议关键字。终端可以为一个,也可以为多个。服务器接收到终 端发送的会议记录请求后,则向终端发送录音指令,终端进而根据录音指令进行录音,并将录制的语音数据实时上传至服务器。
步骤206,对语音数据进行特征提取,得到多个语音特征信息。
服务器接收到终端上传的语音数据后,对语音信号进行预处理,例如服务器可以获取语音数据中的语音信号,并对语音数据中的语音信号进行噪声抑制等预处理,得到预处理后的语音信号。服务器进一步对预处理后的语音信号数据进行特征提取,并对特征提取后的语音信号进行语音端点检测,服务器则根据语音端点将语音数据切分为多个语音特征信息。
步骤208,将多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据声纹标识将多个语音片段转换为对应的文本信息。
服务器进一步获取预设的声纹识别模型,将预处理后的语音信号数据输入至预先训练的声纹识别模型中,通过声纹识别模型计算多个语音特征信息对应的特征参数序列,根据特征参数序列的相似度将语音信号切分为多个语音片段,服务器并在声纹模型库中进行匹配,得到对应匹配的声纹标识,由此能够利用声纹识别模型识别得到多个语音片段和对应的声纹标识。服务器进而根据识别后的声纹标识将语音信号转换为对应的文本信息。
步骤210,将文本信息输入至已训练的语义分析模型,分析文本信息中的关键字和校正信息,利用关键字和校正信息生成分析结果。
服务器将语音信号转换为对应的文本信息后,进一步获取预设的语义分析模型,语义分析模型可以是预先利用大量语料数据训练得到的语义分析模型。服务器将识别出的文本信息输入至已训练的语义分析模型中,通过语义分析模型对识别出的文本信息根据上下文语义进行分析,分析出存在歧义或不清楚的文本信息,进而分析出校正信息。校正信息可以包括有误文本、替换文本和对应的文本位置。服务器同时还通过语义分析模型对文本信息进行上下文分析,并识别出文本信息中出现频率较高的关键字,得到分析出的关键字,进而得到包含关键字和校正信息的分析结果。
步骤212,根据分析结果中的校正信息对文本信息进行校正,对校正后的文本信息添加对应的关键字。
步骤214,根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
服务器通过语义分析模型对文本信息进行分析并得到分析结果后,根据校正信息对文本信息的上下文存在歧义或不清楚文本进行调整和校正。服务器并根据分析出的关键字,在文本信息对应的位置添加关键字。服务器还可以根据关键字对应的预设字符对文本信息进行调 整,得到归纳后的文本信息。
当服务器将会议过程中的所有语音数据转换为对应的文本信息后,服务器则进一步根据会议主题获取预设的会议报告模板,并根据会议报告目标和添加关键字后的文本信息按照预设方式生成对应的会议报告数据。服务器通过对会议中的语音数据进行语音识别和声纹识别,能够准确有效地识别出会议中的各个发言人和对应的语音,服务器根据用户标识将语音数据转换为对应的文本信息,并按照预设方式将文本信息生成对应的会议报告数据,由此能够有效地生成对应的会议报告数据。
上述会议语音数据处理方法中,服务器接收到终端上传的语音数据后,对语音数据进行特征提取,得到多个语音特征信息后,获取预设的声纹识别模型,通过声纹识别模型对所述语音特征信息进行声纹识别,由此能够有效得到多个语音片段和有效识别出对应的声纹标识,服务器进而根据声纹标识将多个语音片段转换为对应的文本信息。服务器进一步获取预设的语义分析模型,通过语义分析模型对文本信息进行上下文语义分析,由此能够准确有效地分析出文本信息中的关键字和校正信息,并根据校正信息对文本信息进行校正,并对校正后的文本信息添加对应的关键字,进而根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据,由此能够有效地生成对应的会议报告数据,进而能够有效地提高会议语音数据的处理效率和识别准确率。
在其中一个实施例中,对语音数据进行特征提取,得到多个语音特征信息的步骤包括:获取语音数据的语音信号,对语音信号进行分帧加窗,提取出对应的声学特征和频谱特征;对声学特征和频谱特征进行转换,得到对应的声学特征向量和频谱特征向量;将声学特征向量和频谱特征向量输入至已训练的语音端点检测模型,通过语音端点检测模型检测语音信号的多个起始点和终止点;根据语音信号的多个起始点和终止点将语音数据切分为多个语音特征信息。
服务器接收到终端上传的语音数据后,对语音数据进行声学特征提取。具体地,服务器提取语音数据中的语音信号,终端上传的语音数据中的语音信号通常为带有噪声的带噪语音信号。服务器获取语音信号后,对语音信号进行加窗分帧,提取出对应的声学特征和频谱特征,并对声学特征和频谱特征进行转换,得到对应的声学特征向量和频谱特征向量。
服务器进一步获取预设的语音端点检测模型,语音端点检测模型可以是预先已训练得到的模型。服务器将声学特征向量和频谱特征向量输入至语音端点检测模型,通过语音端点检测模型对输入的声学特征向量和频谱特征向量进行分类,可以得到声学特征向量和频谱特征向量对应的决策值。当得到的决策值为预设的第一阈值时,对声学特征向量或频谱特征向量 添加语音标签。其中,第一阈值可以是一个范围值。当得到的决策值为预设的第二阈值时,对声学特征向量或频谱特征向量添加非语音标签。进而可以得到添加语音标签的声学特征向量和添加语音标签的频谱特征向量。对添加语音标签的声学特征向量和添加语音标签的频谱特征向量进行解析,得到添加语音标签的语音信号,根据语音信号的时序和添加语音标签确定语音信号的多个起始点和终止点,进而根据语音信号的多个起始点和终止点将语音数据切分为多个语音特征信息。通过利用语音端点检测模型对语音信号进行端点检测和分类,从而能够准确地识别出带噪语音信号中语音信号和非语音信号,进而能够有效提取出语音数据中的语音特征信息。
在其中一个实施例中,该方法还包括:获取多个语音样本数据,语音样本数据包括已标注样本数据和未标注样本数据,利用已标注样本数据生成训练集,利用未标注样本数据生成验证集;将训练集中的语音样本数据输入至预设的声纹识别模型中进行训练,得到初始声纹识别模型;将验证集中的语音样本数据输入至初始声纹识别模型中进行训练和验证;及直到验证集中的满足预设匹配度值的语音样本数据的数量达到目标阈值时,停止训练,得到训练完成的声纹识别模型。
服务器在获取预设的声纹识别模型之前,需要预先构建出声纹识别模型。具体地,服务器可以首先获取大量的语音样本数据,语音样本数据包括已标注样本数据和未标注样本数据,已标注样本数据为已经预先标注声纹标识的语音样本数据。服务器则将语音样本数据分为训练集和验证集,具体地,服务器利用已标注样本数据生成训练集,训练集中则为已经标注的语音样本数据,服务器并利用未标注样本数据生成验证集。服务器则将训练集中的语音样本数据输入至预设的声纹识别模型中进行训练,得到训练后的初始声纹识别模型,服务器进而将验证集中的语音样本数据输入至初始的声纹识别模型中进行持续训练和验证。直到验证集中的满足预设匹配度值的语音样本数据的数量达到预设阈值时,则停止训练,进而得到训练完成的声纹识别模型。服务器进一步将声纹模型库中用户的声纹输入至训练完成的声纹识别模型中,由此能够有效地构建出识别准确率较高的声纹识别模型。
在其中一个实施例中,如图3所示,通过声纹识别模型对语音特征信息进行声纹识别的步骤,具体包括以下内容:
步骤302,通过声纹识别模型计算多个语音特征信息中的特征参数序列。
步骤304,根据特征参数序列对语音特征信息进行切分,的到多个语音片段。
步骤306,计算多个语音片段的特征参数序列的相似度,将相似度达到预设阈值的多个语音片段进行归类。
步骤308,根据归类后的特征参数序列在声纹模型库中进行匹配,对归类后的语音片段添加相匹配的声纹标识。
服务器对语音信号数据进行特征提取得到多个语音特征信息后,则获取预设的声纹识别模型,声纹识别模型可以是服务器预先利用大量语音样本数据进行训练得到的模型。服务器将多个语音特征信息输入至声纹识别模型中,通过声纹识别模型对语音特征信息进行声纹识别。具体地,服务器通过声纹识别模型计算出每个语音特征信息中的特征参数序列,每个语音特征序列中可能包括不同特征参数序列的语音片段。服务器则根据不同的特征参数序列将语音特征信息切分为多个语音片段,并将特征参数序列相同的语音片段进行归类,例如,可以将特征参数序列相同的多个语音片段归为同一类。
服务器进而根据归类后的特征参数序列在声纹模型库中进行匹配,声纹模型库中预先存储了多个发言人对应的声纹特征和对应的声纹标识。服务器通过将多个语音片段的特征参数序列与声纹模型库中的声纹特征进行匹配,提取出匹配度最高的特征参数序列对应的声纹标识,进而能够有效地匹配出每个语音片段所对应的声纹标识,服务器则对每一个语音片段添加识别出的对应的声纹标识,由此能够有效地识别出语音数据中各个发言人对应的语音片段信息。
通过声纹识别模型计算出每个语音特征片段对应的特征参数序列,并将特征参数序列与预设的声纹模型库中的特征参数序列进行比对,提取出匹配度最高的特征参数序列的用户对应的声纹标识,并将用户标识添加至对应的语音特征片段中,由此能够准确有效地识别出会议中的各个发言人和对应的语音。
服务器将多个语音片段转换为对应的文本信息后,进一步获取预设的语义分析模型,通过语义分析模型对文本信息进行上下文语义分析,得到包括关键字和校正信息的分析结果。服务器则根据校正信息对文本信息进行校正,并对校正后的文本信息添加对应的关键字,进而根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。由此能够有效地生成对应的会议报告数据,进而能够有效地提高会议语音数据的处理效率和识别准确率。
在其中一个实施例中,如图4所示,通过语义分析模型对文本信息进行上下文语义分析的步骤,具体包括以下内容:
步骤402,通过语义分析模型对所述文本信息进行上下文分析,得到多个文本的语义和词频。
步骤404,根据多个文本的语义和词频识别出文本信息中的关键字。
步骤406,根据多个文本的语义和关键字分析文本信息中的待校正文本和对应的文本位置,根据多个文本的语义确定待校正文本对应的校正文本。
步骤408,根据待校正文本和校正文本以及对应的文本位置生成校正信息,利用关键字和校正信息生成文本信息的分析结果。
服务器接收到终端上传的语音数据后,对语音数据进行特征提取,得到多个语音特征信息后,获取预设的声纹识别模型,通过声纹识别模型对所述语音特征信息进行声纹识别,由此能够有效得到多个语音片段和有效识别出对应的声纹标识,服务器进而根据声纹标识将多个语音片段转换为对应的文本信息。
服务器得到语音数据对应的文本信息后,进一步获取预设的语义分析模型,通过语义分析模型对文本信息进行上下文语义分析。具体地,语义分析模型可以是基于神经网络的机器学习模型。服务器通过语义分析模型对文本信息进行上下文分析,分析多个文本的语义,并识别出文本信息中出现频率较高的文本,根据多个文本的语义和词频得到分析出的关键字。其中,词频可以是某个词或短语在文本信息中出现的频率。
服务器进一步通过语义分析模型根据多个文本的语义和关键字分析出存在歧义或不清楚的文本信息,进而分析出文本信息中的待校正文本和对应的文本位置,根据多个文本的语义确定待校正文本对应的校正文本。服务器根据待校正文本和校正文本以及对应的文本位置生成校正信息。服务器则利用关键字和校正信息对文本信息进行校正,并对校正后的文本信息添加对应的关键字,进而根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据,由此能够有效地生成对应的会议报告数据,进而能够有效地提高会议语音数据的处理效率和识别准确率。
在其中一个实施例中,根据校正信息对文本信息进行校正的步骤包括:根据校正信息确定文本信息中的待校正位置;利用校正信息中的校正文本对待校正位置的待校正文本进行替换,并删除待校正文本,得到校正后的文本信息。
服务器将语音数据转换为对应的文本信息后,进一步获取预设的语义分析模型,通过语义分析模型对文本信息进行上下文语义分析。具体地,语义分析模型可以是基于神经网络的机器学习模型。服务器通过语义分析模型对文本信息进行上下文分析,分析多个文本的语义,并识别出文本信息中出现频率较高的文本,根据多个文本的语义和词频得到分析出的关键字。
服务器进一步通过语义分析模型根据多个文本的语义和关键字分析出存在歧义或不清楚的文本信息,进而分析出文本信息中的校正信息,校正信息中可以包括待校正文本和校正文本以及对应的文本位置。服务器则利用关键字和校正信息对文本信息进行校正,具体地,服 务器根据校正信息确定文本信息中的待校正位置,利用校正信息中的校正文本对待校正位置的待校正文本进行替换,并删除待校正文本,从而将待校正文本修改为校正文本,进而得到校正后的文本信息。
服务器并对校正后的文本信息添加对应的关键字,具体地,服务器根据识别出的关键字,将关键字添加至关键字所对应的文本区域。服务器进而根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据,由此能够有效地生成对应的会议报告数据。通过语义分析模型分析多个文本的语义和词频,由此分析出关键字和校正信息,并根据关键字和校正信息对文本信息进行调整和校正,进而能够有效地提高会议语音数据的识别准确率。
在其中一个实施例中,该方法还包括:接收到终端发送的查询请求,查询请求携带关键字;根据关键字获取与关键字相关联的会议文本内容;将文本内容按照预设方式发送至终端,并进行显示。
服务器将会议过程中的语音数据生成对应的会议报告数据后,将会以报告数据进行存储。用户可以利用关键字查询会议报告数据中对应的会议文本内容。具体地,用户可以通过对应的用户终端向服务器发送查询请求,查询请求中携带了会议主题和关键字。其中,关键字也可以包括用户标识,用户标识与声纹标识可以相一致。服务器接收到用户终端发送的查询请求后,根据会议主题和关键字从数据库中获取会议报告数据中与该关键字相关联的会议文本内容。并将会议文本内容按照预设方式发送至用户终端,例如可以对文本内容进行突出显示。由此可以有效地使用户快速便捷地了解到用户所需的会议内容。
应该理解的是,虽然图2-4的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-4中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图5所示,提供了一种会议语音数据处理装置,包括:数据获取模块502、特征提取模块504、声纹识别模块506、语义分析模块508、文本校正模块510和会议报告生成模块512,其中:
数据获取模块502,用于接收终端发送的会议记录请求,根据会议记录请求向终端发送录音指令,会议记录请求携带了会议主题;获取终端根据录音指令上传的语音数据;
特征提取模块504,用于对语音数据进行特征提取,得到多个语音特征信息;
声纹识别模块506,用于将多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据声纹标识将多个语音片段转换为对应的文本信息;
语义分析模块508,用于将所述文本信息输入至已训练的语义分析模型,分析文本信息中的关键字和校正信息,利用关键字和校正信息生成分析结果;
文本校正模块510,用于根据分析结果中的校正信息对文本信息进行校正,对校正后的文本信息添加对应的关键字;
会议报告生成模块512,用于根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
在其中一个实施例中,特征提取模块504还用于获取语音数据的语音信号,对语音信号进行分帧加窗,提取出对应的声学特征和频谱特征;对声学特征和频谱特征进行转换,得到对应的声学特征向量和频谱特征向量;将声学特征向量和频谱特征向量输入至已训练的语音端点检测模型,通过语音端点检测模型检测语音信号的多个起始点和终止点;根据语音信号的多个起始点和终止点将语音数据切分为多个语音特征信息。
在其中一个实施例中,该装置还包括声纹识别模型训练模块,用于获取多个语音样本数据,语音样本数据包括已标注样本数据和未标注样本数据,利用已标注样本数据生成训练集,利用未标注样本数据生成验证集;将训练集中的语音样本数据输入至预设的声纹识别模型中进行训练,得到初始声纹识别模型;将验证集中的语音样本数据输入至初始声纹识别模型中进行训练和验证;及直到验证集中的满足预设匹配度值的语音样本数据的数量达到目标阈值时,停止训练,得到训练完成的声纹识别模型。
在其中一个实施例中,声纹识别模块506还用于通过声纹识别模型计算多个语音特征信息中的特征参数序列;根据特征参数序列对语音特征信息进行切分,得到多个语音片段;计算多个语音片段的特征参数序列的相似度,将相似度达到预设阈值的多个语音片段进行归类;根据归类后的特征参数序列在声纹模型库中进行匹配,对归类后的语音片段添加相匹配的声纹标识。
在其中一个实施例中,语义分析模块508还用于通过语义分析模型对文本信息进行上下文分析,得到多个文本的语义和词频;根据多个文本的语义和词频识别出文本信息中的关键字;根据多个文本的语义和关键字分析文本信息中的待校正文本和对应的文本位置,根据多个文本的语义确定待校正文本对应的校正文本;及根据待校正文本和校正文本以及对应的文本位置生成校正信息,利用关键字和校正信息生成文本信息的分析结果。
在其中一个实施例中,文本校正模块510还用于根据校正信息确定文本信息中的待校正位置;及利用校正信息中的校正文本对待校正位置的待校正文本进行替换,并删除待校正文本,得到校正后的文本信息。
在其中一个实施例中,该装置还包括查询模块,用于接收到终端发送的查询请求,查询请求携带会议主题和关键字;根据会议主题和关键字获取与关键字相关联的会议文本内容;将文本内容按照预设方式发送至所述终端,并按照预设方式进行显示。
关于会议语音数据处理装置的具体限定可以参见上文中对于会议语音数据处理方法的限定,在此不再赘述。上述会议语音数据处理置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图6所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储语音数据、语音特征信息、声纹模型库、文本信息以及会议报告数据等数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现本申请任意一个实施例中提供的会议语音数据处理方法的步骤。
本领域技术人员可以理解,图6中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器及一个或多个处理器,存储器中储存有计算机可读指令,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
接收终端发送的会议记录请求,根据会议记录请求向终端发送录音指令,会议记录请求携带了会议主题;
获取终端根据录音指令上传的语音数据;
对语音数据进行特征提取,得到多个语音特征信息;
将多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据声纹标识将多个语音片段转换为对应的文本信息;
将文本信息输入至已训练的语义分析模型,分析文本信息中的关键字和校正信息,利用关键字和校正信息生成分析结果;
根据分析结果中的校正信息对文本信息进行校正,对校正后的文本信息添加对应的关键字;及
根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
接收终端发送的会议记录请求,根据会议记录请求向终端发送录音指令,会议记录请求携带了会议主题;
获取终端根据录音指令上传的语音数据;
对语音数据进行特征提取,得到多个语音特征信息;
将多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据声纹标识将多个语音片段转换为对应的文本信息;
将文本信息输入至已训练的语义分析模型,分析文本信息中的关键字和校正信息,利用关键字和校正信息生成分析结果;
根据分析结果中的校正信息对文本信息进行校正,对校正后的文本信息添加对应的关键字;及
根据会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各 个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种会议语音数据处理方法,所述方法包括:
    接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
    获取所述终端根据所述录音指令上传的语音数据;
    对所述语音数据进行特征提取,得到多个语音特征信息;
    将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
    将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
    根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;及
    根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述语音数据进行特征提取,得到多个语音特征信息的步骤包括:
    获取所述语音数据的语音信号,对所述语音信号进行分帧加窗,提取出对应的声学特征和频谱特征;
    对所述声学特征和频谱特征进行转换,得到对应的声学特征向量和频谱特征向量;
    将所述声学特征向量和频谱特征向量输入至已训练的语音端点检测模型,通过所述语音端点检测模型检测所述语音信号的多个起始点和终止点;及
    根据所述语音信号的多个起始点和终止点将所述语音数据切分为多个语音特征信息。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    获取多个语音样本数据,所述语音样本数据包括已标注样本数据和未标注样本数据,利用所述已标注样本数据生成训练集,利用所述未标注样本数据生成验证集;
    将训练集中的语音样本数据输入至预设的声纹识别模型中进行训练,得到初始声纹识别模型;
    将所述验证集中的语音样本数据输入至所述初始声纹识别模型中进行训练和验证;及
    直到所述验证集中的满足预设匹配度值的语音样本数据的数量达到目标阈值时,停止训练,得到训练完成的声纹识别模型。
  4. 根据权利要求1所述的方法,其特征在于,所述将所述多个语音特征信息输入至已训 练的声纹识别模型中进行声纹识别的步骤包括:
    通过所述声纹识别模型计算多个语音特征信息中的特征参数序列;
    根据所述特征参数序列对所述语音特征信息进行切分,得到多个语音片段;
    计算所述多个语音片段的特征参数序列的相似度,将所述相似度达到预设阈值的多个语音片段进行归类;及
    根据归类后的特征参数序列在声纹模型库中进行匹配,对归类后的语音片段添加相匹配的声纹标识。
  5. 根据权利要求1所述的方法,其特征在于,所述将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息的步骤包括:
    通过所述语义分析模型对所述文本信息进行上下文分析,得到多个文本的语义和词频;
    根据所述多个文本的语义和词频识别出文本信息中的关键字;
    根据所述多个文本的语义和关键字分析文本信息中的待校正文本和对应的文本位置,根据所述多个文本的语义确定待校正文本对应的校正文本;及
    根据所述待校正文本和校正文本以及对应的文本位置生成校正信息,利用所述关键字和所述校正信息生成文本信息的分析结果。
  6. 根据权利要求5所述的方法,其特征在于,所述根据所述校正信息对所述文本信息进行校正的步骤包括:
    根据所述校正信息确定所述文本信息中的待校正位置;及
    利用所述校正信息中的校正文本对所述待校正位置的待校正文本进行替换,并删除所述待校正文本,得到校正后的文本信息。
  7. 根据权利要求1至6任意一项所述的方法,其特征在于,所述方法还包括:
    接收到终端发送的查询请求,所述查询请求携带会议主题和关键字;
    根据所述会议主题和关键字获取与所述关键字相关联的会议文本内容;及
    将所述文本内容按照预设方式发送至所述终端,按照预设方式进行显示。
  8. 一种会议语音数据处理装置,所述装置包括:
    请求接收模块,用于接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
    数据获取模块,用于获取所述终端根据所述录音指令上传的语音数据;
    特征提取模块,用于对所述语音数据进行特征提取,得到多个语音特征信息;
    声纹识别模块,用于将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹 识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
    语义分析模块,用于将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
    文本校正模块,用于根据所述分析结果中的校正信息对所述文本信息进行校正,并对校正后的文本信息添加对应的关键字;及
    会议报告生成模块,用于根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
  9. 根据权利要求8所述的装置,其特征在于,所述特征提取模块还用于获取所述语音数据的语音信号,对所述语音信号进行分帧加窗,提取出对应的声学特征和频谱特征;对所述声学特征和频谱特征进行转换,得到对应的声学特征向量和频谱特征向量;将所述声学特征向量和频谱特征向量输入至已训练的语音端点检测模型,通过所述语音端点检测模型检测所述语音信号的多个起始点和终止点;及根据所述语音信号的多个起始点和终止点将所述语音数据切分为多个语音特征信息。
  10. 根据权利要求8所述的装置,其特征在于,所述装置还包括声纹识别模型训练模块,用于获取多个语音样本数据,所述语音样本数据包括已标注样本数据和未标注样本数据,利用所述已标注样本数据生成训练集,利用所述未标注样本数据生成验证集;将训练集中的语音样本数据输入至预设的声纹识别模型中进行训练,得到初始声纹识别模型;将所述验证集中的语音样本数据输入至所述初始声纹识别模型中进行训练和验证;及直到所述验证集中的满足预设匹配度值的语音样本数据的数量达到目标阈值时,停止训练,得到训练完成的声纹识别模型。
  11. 根据权利要求8所述的装置,其特征在于,所述声纹识别模块还用于通过所述声纹识别模型计算多个语音特征信息中的特征参数序列;根据所述特征参数序列对所述语音特征信息进行切分,得到多个语音片段;计算所述多个语音片段的特征参数序列的相似度,将所述相似度达到预设阈值的多个语音片段进行归类;及根据归类后的特征参数序列在声纹模型库中进行匹配,对归类后的语音片段添加相匹配的声纹标识。
  12. 根据权利要求8所述的装置,其特征在于,所述语义分析模块还用于通过所述语义分析模型对所述文本信息进行上下文分析,得到多个文本的语义和词频;根据所述多个文本的语义和词频识别出文本信息中的关键字;根据所述多个文本的语义和关键字分析文本信息中待校正文本和对应的文本位置,根据所述多个文本的语义确定待校正文本对应的校正文本; 及根据所述待校正文本和校正文本以及对应的文本位置生成校正信息,利用所述关键字和所述校正信息生成文本信息的分析结果。
  13. 根据权利要求12所述的装置,其特征在于,所述语义分析模块还用于根据所述校正信息确定所述文本信息中的待校正位置;及利用所述校正信息中的校正文本对所述待校正位置的待校正文本进行替换,并删除所述待校正文本,得到校正后的文本信息。
  14. 根据权利要求8至13任一项所述的装置,其特征在于,所述装置还包括查询模块,用于接收到终端发送的查询请求,所述查询请求携带会议主题和关键字;根据所述会议主题和关键字获取与所述关键字相关联的会议文本内容;及将所述文本内容按照预设方式发送至所述终端,按照预设方式进行显示。
  15. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
    获取所述终端根据所述录音指令上传的语音数据;
    对所述语音数据进行特征提取,得到多个语音特征信息;
    将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
    将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
    根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;及
    根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
  16. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:通过所述声纹识别模型计算多个语音特征信息中的特征参数序列;根据所述特征参数序列对所述语音特征信息进行切分,得到多个语音片段;计算所述多个语音片段的特征参数序列的相似度,将所述相似度达到预设阈值的多个语音片段进行归类;及根据归类后的特征参数序列在声纹模型库中进行匹配,对归类后的语音片段添加相匹配的声纹标识。
  17. 根据权利要求15所述的计算机设备,其特征在于,所述处理器执行所述计算机可读 指令时还执行以下步骤:通过所述语义分析模型对所述文本信息进行上下文分析,得到多个文本的语义和词频;根据所述多个文本的语义和词频识别出文本信息中的关键字;根据所述多个文本的语义和关键字分析文本信息中的待校正文本和对应的文本位置,根据所述多个文本的语义确定待校正文本对应的校正文本;及根据所述待校正文本和校正文本以及对应的文本位置生成校正信息,利用所述关键字和所述校正信息生成文本信息的分析结果。
  18. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收终端发送的会议记录请求,根据所述会议记录请求向所述终端发送录音指令,所述会议记录请求携带了会议主题;
    获取所述终端根据所述录音指令上传的语音数据;
    对所述语音数据进行特征提取,得到多个语音特征信息;
    将所述多个语音特征信息输入至已训练的声纹识别模型中进行声纹识别,得到多个语音片段和对应的声纹标识,根据所述声纹标识将所述多个语音片段转换为对应的文本信息;
    将所述文本信息输入至已训练的语义分析模型,分析所述文本信息中的关键字和校正信息,利用所述关键字和校正信息生成分析结果;
    根据所述分析结果中的校正信息对所述文本信息进行校正,对校正后的文本信息添加对应的关键字;及
    根据所述会议主题和添加关键字的文本信息按照预设方式生成对应的会议报告数据。
  19. 根据权利要求18所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:通过所述声纹识别模型计算多个语音特征信息中的特征参数序列;根据所述特征参数序列对所述语音特征信息进行切分,得到多个语音片段;计算所述多个语音片段的特征参数序列的相似度,将所述相似度达到预设阈值的多个语音片段进行归类;及根据归类后的特征参数序列在声纹模型库中进行匹配,对归类后的语音片段添加相匹配的声纹标识。
  20. 根据权利要求18所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:通过所述语义分析模型对所述文本信息进行上下文分析,得到多个文本的语义和词频;根据所述多个文本的语义和词频识别出文本信息中的关键字;根据所述多个文本的语义和关键字分析文本信息中的待校正文本和对应的文本位置,根据所述多个文本的语义确定待校正文本对应的校正文本;及根据所述待校正文本和校正文本以及对应的文本位置生成校正信息,利用所述关键字和所述校正信息生成文本信息的分析结果。
PCT/CN2019/118267 2019-06-05 2019-11-14 会议语音数据处理方法、装置、计算机设备和存储介质 WO2020244153A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910494580.7 2019-06-05
CN201910494580.7A CN110322872A (zh) 2019-06-05 2019-06-05 会议语音数据处理方法、装置、计算机设备和存储介质

Publications (1)

Publication Number Publication Date
WO2020244153A1 true WO2020244153A1 (zh) 2020-12-10

Family

ID=68121008

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118267 WO2020244153A1 (zh) 2019-06-05 2019-11-14 会议语音数据处理方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110322872A (zh)
WO (1) WO2020244153A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299291A (zh) * 2021-05-18 2021-08-24 北京明略昭辉科技有限公司 基于关键词的录音保存方法、装置、设备及存储介质

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110322872A (zh) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 会议语音数据处理方法、装置、计算机设备和存储介质
CN110889266A (zh) * 2019-11-21 2020-03-17 北京明略软件系统有限公司 一种会议记录整合方法和装置
CN111261155A (zh) * 2019-12-27 2020-06-09 北京得意音通技术有限责任公司 语音处理方法、计算机可读存储介质、计算机程序和电子设备
CN111708912A (zh) * 2020-05-06 2020-09-25 深圳震有科技股份有限公司 视频会议记录查询处理方法、装置
CN111626061A (zh) * 2020-05-27 2020-09-04 深圳前海微众银行股份有限公司 会议记录生成方法、装置、设备及可读存储介质
CN111666403B (zh) * 2020-06-18 2024-02-09 中国银行股份有限公司 会议纪要处理方法、装置及会议纪要处理设备
CN112017655B (zh) * 2020-07-25 2024-06-14 云开智能(深圳)有限公司 一种智能语音收录回放方法及其系统
CN111883128B (zh) * 2020-07-31 2024-08-13 中国工商银行股份有限公司 语音处理方法及系统、语音处理装置
CN112036820B (zh) * 2020-08-24 2023-06-23 北京鸿联九五信息产业有限公司 一种企业内部信息反馈处理方法、系统、存储介质及设备
CN112183107B (zh) * 2020-09-04 2024-08-20 华为技术有限公司 音频的处理方法和装置
CN112287691B (zh) * 2020-11-10 2024-02-13 深圳市天彦通信股份有限公司 会议录音方法及相关设备
CN112651240A (zh) * 2020-12-30 2021-04-13 广东电力信息科技有限公司 业务会议信息处理系统、方法、电子设备及存储介质
CN113327619B (zh) * 2021-02-26 2022-11-04 山东大学 一种基于云—边缘协同架构的会议记录方法及系统
JP2024514260A (ja) * 2021-04-07 2024-03-29 ネイバー コーポレーション 音声録音後の情報に基づいて生成された音声記録を提供する方法及びシステム
CN113129895B (zh) * 2021-04-20 2022-12-30 上海仙剑文化传媒股份有限公司 一种语音检测处理系统
CN115472159A (zh) * 2021-06-11 2022-12-13 海信集团控股股份有限公司 一种语音处理方法、装置、设备及介质
CN113722425B (zh) * 2021-07-23 2024-08-27 阿里巴巴达摩院(杭州)科技有限公司 数据处理方法、计算机设备及计算机可读存储介质
CN113611308B (zh) * 2021-09-08 2024-05-07 杭州海康威视数字技术股份有限公司 一种语音识别方法、装置、系统、服务器及存储介质
CN115623134A (zh) * 2022-10-08 2023-01-17 中国电信股份有限公司 会议音频处理方法、装置、设备及存储介质
CN115512692B (zh) * 2022-11-04 2023-02-28 腾讯科技(深圳)有限公司 语音识别方法、装置、设备及存储介质
CN117456984B (zh) * 2023-10-26 2024-08-06 杭州捷途慧声科技有限公司 一种基于声纹识别的语音交互方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264887A (zh) * 2000-03-31 2000-08-30 清华大学 基于语音识别专用芯片的非特定人语音识别、语音提示方法
CN104252864A (zh) * 2013-06-28 2014-12-31 国际商业机器公司 实时语音分析方法和系统
CN109388701A (zh) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 会议记录生成方法、装置、设备和计算机存储介质
CN110322872A (zh) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 会议语音数据处理方法、装置、计算机设备和存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005321530A (ja) * 2004-05-07 2005-11-17 Sony Corp 発話識別装置および発話識別方法
CN103258535A (zh) * 2013-05-30 2013-08-21 中国人民财产保险股份有限公司 基于声纹识别的身份识别方法及系统
CN106056996B (zh) * 2016-08-23 2017-08-29 深圳市鹰硕技术有限公司 一种多媒体交互教学系统及方法
CN109145148A (zh) * 2017-06-28 2019-01-04 百度在线网络技术(北京)有限公司 信息处理方法和装置
CN108022583A (zh) * 2017-11-17 2018-05-11 平安科技(深圳)有限公司 会议纪要生成方法、应用服务器及计算机可读存储介质
CN108132995A (zh) * 2017-12-20 2018-06-08 北京百度网讯科技有限公司 用于处理音频信息的方法和装置
CN108198547B (zh) * 2018-01-18 2020-10-23 深圳市北科瑞声科技股份有限公司 语音端点检测方法、装置、计算机设备和存储介质
CN108182945A (zh) * 2018-03-12 2018-06-19 广州势必可赢网络科技有限公司 一种基于声纹特征的多人声音分离方法及装置
CN108986826A (zh) * 2018-08-14 2018-12-11 中国平安人寿保险股份有限公司 自动生成会议记录的方法、电子装置及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1264887A (zh) * 2000-03-31 2000-08-30 清华大学 基于语音识别专用芯片的非特定人语音识别、语音提示方法
CN104252864A (zh) * 2013-06-28 2014-12-31 国际商业机器公司 实时语音分析方法和系统
CN109388701A (zh) * 2018-08-17 2019-02-26 深圳壹账通智能科技有限公司 会议记录生成方法、装置、设备和计算机存储介质
CN110322872A (zh) * 2019-06-05 2019-10-11 平安科技(深圳)有限公司 会议语音数据处理方法、装置、计算机设备和存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113299291A (zh) * 2021-05-18 2021-08-24 北京明略昭辉科技有限公司 基于关键词的录音保存方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN110322872A (zh) 2019-10-11

Similar Documents

Publication Publication Date Title
WO2020244153A1 (zh) 会议语音数据处理方法、装置、计算机设备和存储介质
CN110781916B (zh) 视频数据的欺诈检测方法、装置、计算机设备和存储介质
WO2020140665A1 (zh) 双录视频质量检测方法、装置、计算机设备和存储介质
CN110444198B (zh) 检索方法、装置、计算机设备和存储介质
WO2020177380A1 (zh) 基于短文本的声纹检测方法、装置、设备及存储介质
WO2021068321A1 (zh) 基于人机交互的信息推送方法、装置和计算机设备
CN110378228A (zh) 面审视频数据处理方法、装置、计算机设备和存储介质
CN109543516A (zh) 签约意向判断方法、装置、计算机设备和存储介质
CN110472224B (zh) 服务质量的检测方法、装置、计算机设备和存储介质
WO2020147395A1 (zh) 基于情感的文本分类处理方法、装置和计算机设备
CN111145782B (zh) 重叠语音识别方法、装置、计算机设备和存储介质
CN113223532B (zh) 客服通话的质检方法、装置、计算机设备及存储介质
WO2021027029A1 (zh) 数据处理方法、装置、计算机设备和存储介质
CN113094578B (zh) 基于深度学习的内容推荐方法、装置、设备及存储介质
CN110265032A (zh) 会议数据分析处理方法、装置、计算机设备和存储介质
CN110505504B (zh) 视频节目处理方法、装置、计算机设备及存储介质
US20230089308A1 (en) Speaker-Turn-Based Online Speaker Diarization with Constrained Spectral Clustering
CN109714608B (zh) 视频数据处理方法、装置、计算机设备和存储介质
CN113192516A (zh) 语音角色分割方法、装置、计算机设备及存储介质
CN112233680A (zh) 说话人角色识别方法、装置、电子设备及存储介质
CN114449310A (zh) 视频剪辑方法、装置、计算机设备及存储介质
CN114218427B (zh) 语音质检分析方法、装置、设备及介质
CN109766474A (zh) 审讯信息审核方法、装置、计算机设备和存储介质
US11238289B1 (en) Automatic lie detection method and apparatus for interactive scenarios, device and medium
CN113920986A (zh) 会议记录生成方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19932147

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19932147

Country of ref document: EP

Kind code of ref document: A1