CN113707150A - System and method for processing video conference by voice recognition - Google Patents

System and method for processing video conference by voice recognition Download PDF

Info

Publication number
CN113707150A
CN113707150A CN202111284405.9A CN202111284405A CN113707150A CN 113707150 A CN113707150 A CN 113707150A CN 202111284405 A CN202111284405 A CN 202111284405A CN 113707150 A CN113707150 A CN 113707150A
Authority
CN
China
Prior art keywords
conference
voice
character
important
characters
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111284405.9A
Other languages
Chinese (zh)
Inventor
安佳兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Yunji Intelligent Information Co ltd
Original Assignee
Shenzhen Yunji Intelligent Information Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Yunji Intelligent Information Co ltd filed Critical Shenzhen Yunji Intelligent Information Co ltd
Priority to CN202111284405.9A priority Critical patent/CN113707150A/en
Publication of CN113707150A publication Critical patent/CN113707150A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/65Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Abstract

The invention provides a system and a method for processing a video conference by voice recognition, which are applied to the technical field of communication; acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics; recording the voice of the secondary characters of the conference and the voice of the important characters of the conference; recognizing the voice of the conference secondary characters and the voice of the conference important characters to obtain a statement data format for receiving and recording the voice, listing statement data forms of the conference secondary characters and the conference important characters according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the conference characters; converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters to perform voice classification and inputting the voice classification to a preset conference screen so as to obtain voice data which is subjected to voice recognition processing; the invention effectively reduces the problem of translation error in the process of the voice conference by correcting repeated or invalid data possibly occurring in the process of voice recognition translation conversion.

Description

System and method for processing video conference by voice recognition
Technical Field
The present invention relates to the field of communications technologies, and in particular, to a system and a method for processing a video conference by voice recognition.
Background
With the development of communication technology, the form of a conference gradually develops towards diversification; the existing conference forms comprise not only the traditional single-point conference, but also a multipoint video conference, a multipoint voice conference and the like; the multipoint conference refers to a real-time conference established in different physical locations by means of audio communication and/or video communication, and usually different conference participants exist in each physical location.
The participants of the current video conference system can synchronously see and hear images and sounds of the participants in other conference places in real time, and can also send electronic documents in real time, thereby greatly reducing the conference cost and compressing the conference time.
However, in the prior art, the voice generated by the video conference usually generates recognition interference due to various sudden situations, which brings inconvenience to the whole process of video conference recording, for example, the translation language and the translation characters accompanying the conference screen during the video conference are deviated.
In view of the above, the present invention provides a system and a method for processing a video conference by voice recognition to solve the problem of deviation of translated languages and translated words associated with a conference screen during the video conference.
Disclosure of Invention
The invention aims to solve the problem that the translation language and the translation characters associated with a conference screen are deviated in the video conference process, and provides a system and a method for processing a video conference by voice recognition.
The invention provides a voice recognition processing video conference system, comprising:
the acquisition module is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary people characteristics and important people characteristics; recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;
the recognition module is used for recognizing the conference secondary character voice and the conference important character voice to obtain a statement data format for receiving and recording the voice, listing statement data forms of the conference secondary character and the conference important character according to the statement data format, and classifying text formats of single statements and continuous statements according to the conference character statement data forms; and converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing.
Further, the obtaining module further comprises:
the acquisition subunit is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary characteristics and important characteristics;
and the receiving and recording subunit is used for receiving and recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences.
Further, the identification module further comprises:
the recognition subunit is configured to recognize the voices of the secondary conference characters and the voices of the important conference characters to obtain a statement data format of the recorded voices, list statement data forms of the secondary conference characters and the important conference characters according to the statement data format, and classify text formats of single statements and continuous statements according to the statement data forms of the secondary conference characters and the important conference characters; (ii) a
And the operator pushing unit is used for converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification and inputting the voice classification to a preset conference screen, and further obtaining voice data which is processed by voice recognition.
The invention also provides a method for processing the video conference by voice recognition, which comprises the following steps:
acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics;
recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;
recognizing the voices of the secondary characters of the conference and the voices of the important characters of the conference to obtain a statement data format for receiving and recording the voices, listing statement data forms of the secondary characters of the conference and the important characters of the conference according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary characters of the conference and the important characters of the conference;
converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing; the specific process of statement data format conversion is as follows: the system compares the statement data with modeled sound data set by the system, matches sound data with similarity, and performs voice classification to extract acoustic features of the sound data.
Further, the step of obtaining the conference character features including the secondary character features and the important character features further comprises the following steps:
identifying all the persons participating in the conference according to a preset conference figure form;
acquiring the information of the persons participating in the conference, wherein the information of the persons participating in the conference comprises clothes of the persons, body shapes of the persons and faces of the persons;
and matching and classifying the conference people according to the conference people information, wherein the conference people are conference secondary people and conference important people.
Further, the step of receiving the voice of the secondary conference character and the voice of the important conference character further comprises:
adopting a preset sentence encoder to obtain sentence data vectors after analyzing and encoding the conference secondary character voice and the conference important character voice according to the recorded conference secondary character voice and the conference important character voice, wherein the sentence data vectors comprise a sentence text model, sentence text characteristics and a sentence text type;
performing vector word segmentation on the statement data vector to obtain a plurality of vector word segments comprising token strings of statements and token strings of lexical methods;
and respectively listing the token string of the sentence and the token string of the lexical as a first text vector and a second text vector.
Further, the step of recognizing the voices of the secondary character and the important character of the conference comprises the following steps:
generating the speech vector classes for the secondary conference character speech and the important conference character speech including speech frequency, speech filtering and speech decoding,
sampling the voice frequency according to the recorded voice, wherein the sampling amount is half of the voice frequency, and voice filtering for completing sampling is obtained;
and after waveform data are obtained by sampling according to the voice filtering, extracting waveform data characteristic parameters, and performing parameter synthesis on the waveform data characteristic parameters through voice decoding preset by the system to obtain recognizable recorded voice.
Further, the step of classifying the text formats of the single sentence and the continuous sentence includes:
according to the preset limited amount of text content, the text content range of the sentence is limited to be within fifty words or more than fifty words, and the sentence is expressed as
Figure 646572DEST_PATH_IMAGE001
Or
Figure 460945DEST_PATH_IMAGE002
(ii) a Checking the recorded statement data form, and selecting the text content range as
Figure 548986DEST_PATH_IMAGE001
The text content of (2) is a single sentence; selecting the text content range as
Figure 816020DEST_PATH_IMAGE002
Are continuous sentences.
Further, the step of classifying the single sentence includes:
arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the single statement;
obtaining a feature vector representation of mapping different potential factors in statement data to a single text through integration;
updating the connection times of the statement data feature vector and the single text feature vector on each potential factor by an iterative method;
and integrating the statement data feature vectors to obtain the single statement data after calculation.
Further, the step of speech classifying the continuous sentence comprises:
the step of classifying the continuous sentences comprises:
arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the continuous statements;
obtaining feature vector representations of different potential factors mapped to a plurality of texts in statement data through integration;
updating the connection times of the sentence data feature vector and the text feature vectors on each potential factor by an iterative method;
and integrating the sum of the statement data feature vectors to obtain the continuous statement data after calculation.
The invention provides a system and a method for processing a video conference by voice recognition, which have the following beneficial effects:
the invention effectively reduces the problem of translation error in the voice conference process by correcting repeated or invalid data possibly occurring in the voice recognition translation conversion process.
Drawings
FIG. 1 is a block diagram of a voice recognition processing video conferencing system according to an embodiment of the present invention;
fig. 2 is a flowchart illustrating a voice recognition processing video conference method according to an embodiment of the present invention.
Detailed Description
It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be considered as limiting thereof, since the objects, features and advantages thereof will be further described with reference to the accompanying drawings.
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by persons skilled in the art from the embodiments given herein without making any inventive step, are within the scope of the present invention.
Referring to fig. 1, for a voice recognition processing video conferencing system in one embodiment of the present invention,
the acquisition module is used for acquiring the conference character characteristics including secondary character characteristics and important character characteristics; recording the voice of the secondary characters of the conference and the voice of the important characters of the conference;
the recognition module is used for recognizing the conference secondary character voice and the conference important character voice to obtain a statement data format for receiving and recording the voice, listing statement data forms of the conference secondary character and the conference important character according to the statement data format, and classifying text formats of single statements and continuous statements according to the conference character statement data forms; and converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing.
In a specific embodiment: the acquisition module acquires the conference character features including secondary character features and important character features; recording the voice of the secondary characters of the conference and the voice of the important characters of the conference; the recognition module recognizes the conference secondary character voice and the conference important character voice to obtain a statement data format of the recorded voice, lists statement data forms of the conference secondary character and the conference important character according to the statement data format, and classifies text formats of single statements and continuous statements according to the conference character statement data forms; converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing;
wherein, the characters are specifically as follows: sound characteristics of the secondary and important meeting characters, including different tones and wavelengths emitted for each character; the character voice is specifically the sound color and the loudness of the sentence of each character in the conference;
the statement data format specifically includes: framing the voice signal of each person in the conference, wherein the framing is required because the voice signal is rapidly changed; in voice recognition, the frame length is generally 20-50 ms, so that a sufficient period is provided in one frame, and the change is not too violent; each frame of voice signal is usually multiplied by the next frame of voice signal, so that the two ends of the frame are smoothly attenuated to zero, the strength after voice conversion can be reduced, a higher quality frequency spectrum is obtained, the time difference between the frames is generally 10s, and thus, the frames are overlapped, so that the signals at the joint of the frames are not lost due to loss of signal overlapping;
the text format of the single sentence is specifically as follows: short-time or instantaneous speech data generated by each person in the conference, such as: the words of "forehead", "kayi", "yes" and "pair" made by the character;
the text format of the continuous sentence is specifically as follows: long-term or long-string speech data generated by each person in the conference, such as: the beginning of the day is white, that is, the girls are good and welcome to go to the meeting … …;
the voice data form specifically includes: all voice data generated in the whole conference process of the conference secondary characters and the conference important characters comprise the voice data before the conference process, in the conference process and after the conference process, and the voice data before the conference process, in the conference process and after the conference process are re-integrated in sequence to obtain a voice data form;
the speech classification specifically includes: extracting acoustic features of sound data in a system database, wherein the acoustic features comprise word sequences, position codes, phoneme sequences and phoneme features in sentences, and splicing the word sequences, the position codes, the phoneme sequences and the phoneme features to obtain word acoustic features;
the specific process of sentence data feature transformation is as follows: acquiring a text vector and a token string of the voice data, merging the text vector and the token string into a text token set, identifying character features of the character strings in the text token set, constructing a character vector corresponding to the character features, and recombining the character vector set of the text token set; similar elimination is carried out on the reformed and combined character vector set to obtain voice data finished by voice classification, and the method specifically comprises the following steps: conversion between different voices; for example: english is converted into Chinese, and Chinese is converted into English;
the similar elimination of the character vector set after the reintegration is specifically to add the constructed words into the character strings in the token set, search whether the character strings in the token set have the same character strings or the vectors of the same character strings, and correspondingly delete and correct the same character strings or the vectors of the same character strings.
Referring to fig. 2, for a method for processing a video conference by speech recognition according to an embodiment of the present invention,
s1: acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics;
s2: recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;
s3: recognizing the voices of the secondary characters of the conference and the voices of the important characters of the conference to obtain a statement data format for receiving and recording the voices, listing statement data forms of the secondary characters of the conference and the important characters of the conference according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary characters of the conference and the important characters of the conference;
s4: converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing; the specific process of statement data format conversion is as follows: the system compares the statement data with modeled sound data set by the system, matches sound data with similarity, and performs voice classification to extract acoustic features of the sound data.
In a specific embodiment: acquiring the characteristics of the conference characters, recording the voices of the secondary conference characters and the voices of the important conference characters, identifying the voices of the secondary conference characters and the voices of the important conference characters to obtain a sentence data format of the recorded voices, listing sentence data forms of the secondary conference characters and the important conference characters according to the sentence data format, classifying text formats of single sentences and continuous sentences according to the sentence data forms of the conference characters, converting the sentence data characteristics in the sentence data forms of the secondary conference characters and the important conference characters to perform voice classification, inputting the converted sentence data characteristics to a preset conference screen, and further obtaining voice data after voice recognition processing.
In one embodiment: the step of acquiring the conference character features including the secondary character features and the important character features further comprises the following steps:
identifying all the persons participating in the conference according to a preset conference figure form;
acquiring the information of the persons participating in the conference, wherein the information of the persons participating in the conference comprises clothes of the persons, body shapes of the persons and faces of the persons;
and matching and classifying the conference people according to the conference people information, wherein the conference people are conference secondary people and conference important people.
In a specific embodiment: capturing all the people participating in the conference, and carrying out matching comparison with preset conference participants according to the clothing, the body shapes and the faces of the people of the conference people so as to judge secondary people and important people in the people participating in the conference; and performs corresponding matching operation and classification according to the identified persons,
the person who does not wear the conference clothes is discriminated as a secondary person, for example: cleaning a sanitary cleaner or finishing a meeting process secretary;
important persons to be discriminated from the person wearing the conference clothes are, for example: a manager wearing meeting clothing and a card or a CEO leading the meeting process.
In one embodiment: the step of recording the voice of the secondary conference character and the voice of the important conference character further comprises the following steps:
adopting a preset sentence encoder to obtain sentence data vectors after analyzing and encoding the conference secondary character voice and the conference important character voice according to the recorded conference secondary character voice and the conference important character voice, wherein the sentence data vectors comprise a sentence text model, sentence text characteristics and a sentence text type;
performing vector word segmentation on the statement data vector to obtain a plurality of vector word segments comprising token strings of statements and token strings of lexical methods;
and respectively listing the token string of the sentence and the token string of the lexical as a first text vector and a second text vector.
In a specific embodiment: performing voice coding analysis on the conference figure voice by adopting a preset sentence coder to obtain a sentence data vector which is specifically a sentence text model, a sentence text characteristic and a sentence text type; carrying out vector word segmentation on the sentence text model, the sentence text characteristics and the sentence text type to obtain a plurality of vector word segments comprising token strings of sentences and token strings of lexical methods;
the voice coding analysis specifically comprises the steps of compressing and decompressing the audio signals through a dual-rate voice coding algorithm with the type of audio and the bandwidth of 5.3kbps, and due to the adoption of mute compression for executing discontinuous transmission, the bandwidth can be reserved, and meanwhile, the on-off and on-off of carrier signals are avoided;
vector word segmentation specifically converts token words in the sentence data into token strings which are combined, for example: the single character in the sentence data is converted into a sentence, and the sentence is 'happy', 'welcomed', 'big', 'home', 'participated', 'added', 'meeting' and 'conference', and is combined into 'welcome people to attend the conference'.
In one embodiment: the step of identifying the voice of the secondary conference character and the voice of the important conference character comprises the following steps:
generating the voice vector classes for the conference secondary character voice and the conference important character voice into voice frequency, voice filtering and voice decoding,
sampling the voice frequency according to the recorded voice, wherein the sampling amount is half of the voice frequency, and voice filtering for completing sampling is obtained;
and after waveform data are obtained by sampling according to the voice filtering, extracting waveform data characteristic parameters, and performing parameter synthesis on the waveform data characteristic parameters through voice decoding preset by the system to obtain recognizable recorded voice.
In a specific embodiment: generating the recorded voice into a voice vector category comprising voice frequency, voice filtering and voice decoding; sampling the frequency of the recorded voice, and taking half of the voice frequency to obtain voice filtering after sampling; acquiring waveform data by collecting samples in the sampled voice filtering, extracting characteristic parameters of the waveform data, and performing parameter synthesis by combining voice decoding and the characteristic parameters of the waveform data to obtain recognizable recorded voice;
the sampling process of the voice frequency is specifically to collect a frequency with a special wavelength in the voice frequency, for example: the frequency of the more raised voice in the voice generating process is slightly different from the frequency of other voices, and the frequency is a special frequency;
extracting the characteristic parameters of the waveform data, specifically recording the special frequency voice in the voice generating process according to the time sequence, and taking the special frequency with the time sequence as the characteristic parameters;
the process of parameter synthesis by matching voice decoding preset by the system with waveform data characteristic parameters is specifically that parameters are changed once aiming at the decoding of each frame of voice data, and for frequency segments with turbid voice, frame synchronization synthesis is carried out according to different selections of control parameter change moments; frame-synchronous synthesis refers to changing parameters on a frame-by-frame basis.
In one embodiment: the step of classifying the text formats of the single sentence and the continuous sentence includes:
according to the preset limited amount of text content, the text content range of the sentence is limited to be within fifty words or more than fifty words, and the sentence is expressed as
Figure 218182DEST_PATH_IMAGE001
Or
Figure 887061DEST_PATH_IMAGE002
(ii) a Checking the recorded statement data form, and selecting the text content range as
Figure 146004DEST_PATH_IMAGE003
The text content of (2) is a single sentence; selecting the text content range as
Figure 900333DEST_PATH_IMAGE002
Are continuous sentences.
In a specific embodiment: respectively obtaining text format contents of a single sentence and a continuous sentence, and performing range limit on the text contents of the single sentence according to a preset text content limit, wherein the text content limit range of the single sentence is specifically
Figure 106187DEST_PATH_IMAGE001
(ii) a Performing range limit on the text content of the continuous sentences, wherein the range limit of the text content of the continuous sentences is specifically
Figure 629572DEST_PATH_IMAGE002
In one embodiment: the step of classifying the single sentence comprises the following steps:
arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the single statement;
obtaining the feature vector representation of different potential factors mapped to the text in the statement data through integration;
updating the connection times of the statement data characteristic vector and the text characteristic vector on each potential factor by an iterative method;
and integrating the statement data feature vectors to obtain the single statement data after calculation.
In a specific embodiment: the sentence data of the single sentence is sorted to obtain the sentence data characteristic vector of the single sentence, wherein the sentence data characteristic vector comprises a word sequence and a position code; integrating the word sequence and the position codes in statement data to obtain feature vectors of other different potential factor mapping texts, wherein the feature vectors comprise phoneme sequences and phoneme features; performing vector connection on the statement data by using the statement data characteristic vector and the characteristic vector of the latent factor mapping text through an iteration method, and reintegrating the iteration-completed statement data to obtain the calculation-completed single statement data;
the integration specifically comprises the steps of calling out word sequences and position codes corresponding to the word sequences from all data in statement data;
the process of vector connection of the sentence data through iteration is specifically that word sequences and position codes are indexed relative to the sentence data, and feature vectors, namely phoneme sequences and phoneme features, of different potential factor mapping texts are mapped; and connecting the word sequence, the position code, the phoneme sequence and the phoneme characteristic based on vectors in the sentence data to obtain single classified sentence data.
In one embodiment: the step of classifying the continuous sentences comprises:
arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the continuous statements;
obtaining feature vector representations of different potential factors mapped to a plurality of texts in statement data through integration;
updating the connection times of the sentence data feature vector and the text feature vectors on each potential factor by an iterative method;
and integrating the sum of the statement data feature vectors to obtain the continuous statement data after calculation.
In a specific embodiment: sorting the text content of the sentence data of the continuous sentences to obtain the sentence data characteristic vectors of the continuous sentences, wherein the sentence data characteristic vectors comprise word sequences and position codes; integrating the word sequences and the position codes in statement data to obtain feature vectors of other different potential factors mapping a plurality of texts, wherein the feature vectors comprise a plurality of phoneme sequences and a plurality of phoneme features; performing vector connection on statement data by using the statement data characteristic vector and the characteristic vectors of the texts with the latent factors mapped through an iteration method, and reintegrating the iteration-completed statement data to obtain calculated and completed continuous statement data;
the integration specifically comprises the steps of calling out word sequences and position codes corresponding to the word sequences from all data in statement data;
the process of vector connection of the sentence data through iteration is specifically that word sequences and position codes are indexed relative to the sentence data, and different potential factors are mapped to map feature vectors of a plurality of texts, namely a plurality of phoneme sequences and a plurality of phoneme features; and connecting the word sequence, the position code, the phoneme sequence and the phoneme characteristic based on vectors in the sentence data to obtain classified continuous sentence data, wherein the position code of the word sequence and the phoneme sequence exist in the phoneme sequence and the position code relation of the phoneme characteristic as a symmetry axis, namely the position code of the word sequence and the phoneme characteristic are based on a symmetrical position which is a coordinate in the sentence data.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (10)

1. A speech recognition processing videoconferencing system, the speech recognition processing videoconferencing system comprising:
the acquisition module is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary people characteristics and important people characteristics; receiving and recording the voice of a secondary conference character and the voice of an important conference character, wherein the voice of the secondary conference character and the voice of the important conference character comprise character voice and character sentences;
the recognition module is used for recognizing the voice of the secondary conference character and the voice of the important conference character to obtain a statement data format for receiving and recording the voice, listing statement data forms of the secondary conference character and the important conference character according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary conference character and the important conference character; and converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing.
2. The speech recognition processing videoconferencing system of claim 1, wherein the obtaining module further comprises:
the acquisition subunit is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary characteristics and important characteristics;
and the receiving and recording subunit is used for receiving and recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences.
3. The speech recognition processing videoconferencing system of claim 1, wherein the recognition module further comprises:
the recognition subunit is configured to recognize the voices of the secondary conference characters and the voices of the important conference characters to obtain a statement data format of the recorded voices, list statement data forms of the secondary conference characters and the important conference characters according to the statement data format, and classify text formats of single statements and continuous statements according to the statement data forms of the secondary conference characters and the important conference characters;
and the operator pushing unit is used for converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification and inputting the voice classification to a preset conference screen, and further obtaining voice data which is processed by voice recognition.
4. A voice recognition processing video conference method, wherein a voice recognition processing video conference system according to any one of claims 1 to 3 is executed by using the voice recognition processing video conference method, and the voice recognition processing video conference method comprises:
acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics;
recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;
recognizing the voices of the secondary characters of the conference and the voices of the important characters of the conference to obtain a statement data format for receiving and recording the voices, listing statement data forms of the secondary characters of the conference and the important characters of the conference according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary characters of the conference and the important characters of the conference;
converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing; the specific process of statement data format conversion is as follows: the system compares the statement data with modeled sound data set by the system, matches sound data with similarity, and performs voice classification to extract acoustic features of the sound data.
5. The method of claim 4, wherein the step of obtaining the conference personality characteristics includes a secondary personality characteristic and an important personality characteristic further comprises:
identifying all the persons participating in the conference according to a preset conference figure form;
acquiring the information of the persons participating in the conference, wherein the information of the persons participating in the conference comprises clothes of the persons, body shapes of the persons and faces of the persons;
and matching and classifying the conference people according to the conference people information, wherein the conference people are conference secondary people and conference important people.
6. The method of claim 4, wherein the step of recording the secondary character speech and the important character speech further comprises:
adopting a preset sentence encoder to obtain a sentence data vector after analyzing and encoding the conference character voice according to the recorded and recorded conference secondary character voice and the conference important character voice, wherein the sentence data vector comprises a sentence text model, a sentence text feature and a sentence text type;
performing vector word segmentation on the statement data vector to obtain a plurality of vector word segments comprising token strings of statements and token strings of lexical methods;
and respectively listing the token string of the sentence and the token string of the lexical as a first text vector and a second text vector.
7. The method of claim 4, wherein the step of recognizing the voices of the secondary character and the important character comprises:
generating the voice vector classes for the conference secondary character voice and the conference important character voice into voice frequency, voice filtering and voice decoding,
sampling the voice frequency according to the recorded voice, wherein the sampling amount is half of the voice frequency, and voice filtering for completing sampling is obtained;
and after waveform data are obtained by sampling according to the voice filtering, extracting waveform data characteristic parameters, and performing parameter synthesis on the waveform data characteristic parameters through voice decoding preset by the system to obtain recognizable recorded voice.
8. The method of claim 4, wherein the step of classifying the text format of the single sentence and the continuous sentence comprises:
according to the preset limited amount of text content, the text content range of the sentence is limited to be within fifty words or more than fifty words, and the sentence is expressed as
Figure 405608DEST_PATH_IMAGE001
Or
Figure 647233DEST_PATH_IMAGE002
(ii) a Checking the recorded statement data form, and selecting the text content range as
Figure 374887DEST_PATH_IMAGE001
The text content of (2) is a single sentence; selecting the text content range as
Figure 18357DEST_PATH_IMAGE002
Are continuous sentences.
9. The method of claim 4, wherein the step of classifying the single sentence comprises:
arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the single statement;
obtaining a feature vector representation of mapping different potential factors in statement data to a single text through integration;
updating the connection times of the statement data feature vector and the single text feature vector on each potential factor by an iterative method;
and integrating the statement data feature vectors to obtain the single statement data after calculation.
10. The method of claim 4, wherein the step of classifying the continuous sentences comprises:
arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the continuous statements;
obtaining feature vector representations of different potential factors mapped to a plurality of texts in statement data through integration;
updating the connection times of the sentence data feature vector and the text feature vectors on each potential factor by an iterative method;
and integrating the sum of the statement data feature vectors to obtain the continuous statement data after calculation.
CN202111284405.9A 2021-11-01 2021-11-01 System and method for processing video conference by voice recognition Pending CN113707150A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111284405.9A CN113707150A (en) 2021-11-01 2021-11-01 System and method for processing video conference by voice recognition

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111284405.9A CN113707150A (en) 2021-11-01 2021-11-01 System and method for processing video conference by voice recognition

Publications (1)

Publication Number Publication Date
CN113707150A true CN113707150A (en) 2021-11-26

Family

ID=78647589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111284405.9A Pending CN113707150A (en) 2021-11-01 2021-11-01 System and method for processing video conference by voice recognition

Country Status (1)

Country Link
CN (1) CN113707150A (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104519303A (en) * 2013-09-29 2015-04-15 华为技术有限公司 Multi-terminal conference communication processing method and device
CN109887508A (en) * 2019-01-25 2019-06-14 广州富港万嘉智能科技有限公司 A kind of meeting automatic record method, electronic equipment and storage medium based on vocal print
CN110022454A (en) * 2018-01-10 2019-07-16 华为技术有限公司 A kind of method and relevant device identifying identity in video conference
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium
CN110881115A (en) * 2019-12-24 2020-03-13 新华智云科技有限公司 Strip splitting method and system for conference video
US20200192986A1 (en) * 2018-12-17 2020-06-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for translating speech
CN111447397A (en) * 2020-03-27 2020-07-24 深圳市贸人科技有限公司 Translation method and translation device based on video conference
US20200372140A1 (en) * 2019-05-23 2020-11-26 Microsoft Technology Licensing, Llc System and method for authorizing temporary data access to a virtual assistant

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104519303A (en) * 2013-09-29 2015-04-15 华为技术有限公司 Multi-terminal conference communication processing method and device
CN110022454A (en) * 2018-01-10 2019-07-16 华为技术有限公司 A kind of method and relevant device identifying identity in video conference
US20200192986A1 (en) * 2018-12-17 2020-06-18 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for translating speech
CN109887508A (en) * 2019-01-25 2019-06-14 广州富港万嘉智能科技有限公司 A kind of meeting automatic record method, electronic equipment and storage medium based on vocal print
US20200372140A1 (en) * 2019-05-23 2020-11-26 Microsoft Technology Licensing, Llc System and method for authorizing temporary data access to a virtual assistant
CN110298252A (en) * 2019-05-30 2019-10-01 平安科技(深圳)有限公司 Meeting summary generation method, device, computer equipment and storage medium
CN110881115A (en) * 2019-12-24 2020-03-13 新华智云科技有限公司 Strip splitting method and system for conference video
CN111447397A (en) * 2020-03-27 2020-07-24 深圳市贸人科技有限公司 Translation method and translation device based on video conference

Similar Documents

Publication Publication Date Title
CN113408385B (en) Audio and video multi-mode emotion classification method and system
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN111461173B (en) Multi-speaker clustering system and method based on attention mechanism
CN113223509B (en) Fuzzy statement identification method and system applied to multi-person mixed scene
KR20200119410A (en) System and Method for Recognizing Emotions from Korean Dialogues based on Global and Local Contextual Information
CN115910066A (en) Intelligent dispatching command and operation system for regional power distribution network
CN114944149A (en) Speech recognition method, speech recognition apparatus, and computer-readable storage medium
CN111354362A (en) Method and device for assisting hearing-impaired communication
CN116884404B (en) Multitasking voice semantic communication method, device and system
CN112927723A (en) High-performance anti-noise speech emotion recognition method based on deep neural network
CN113744742B (en) Role identification method, device and system under dialogue scene
CN113707150A (en) System and method for processing video conference by voice recognition
CN114822596A (en) Voice emotion recognition method fusing emotion related characteristics of historical sentences
KR19980076309A (en) Speech recognition method and device
CN111402887A (en) Method and device for escaping characters by voice
CN114283493A (en) Artificial intelligence-based identification system
CN115565533A (en) Voice recognition method, device, equipment and storage medium
CN115455136A (en) Intelligent digital human marketing interaction method and device, computer equipment and storage medium
CN110782901B (en) Method, storage medium and device for identifying voice of network telephone
CN112259093A (en) Intelligent customer service interaction system based on voice recognition
CN112820274B (en) Voice information recognition correction method and system
CN117198338B (en) Interphone voiceprint recognition method and system based on artificial intelligence
CN116343751B (en) Voice translation-based audio analysis method and device
CN117437920A (en) Deep learning-based automatic audio understanding method for coal mine dispatching room
JP4932530B2 (en) Acoustic processing device, acoustic processing method, acoustic processing program, verification processing device, verification processing method, and verification processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211126