CN113707150A

CN113707150A - System and method for processing video conference by voice recognition

Info

Publication number: CN113707150A
Application number: CN202111284405.9A
Authority: CN
Inventors: 安佳兵
Original assignee: Shenzhen Yunji Intelligent Information Co ltd
Current assignee: Shenzhen Yunji Intelligent Information Co ltd
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2021-11-26

Abstract

The invention provides a system and a method for processing a video conference by voice recognition, which are applied to the technical field of communication; acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics; recording the voice of the secondary characters of the conference and the voice of the important characters of the conference; recognizing the voice of the conference secondary characters and the voice of the conference important characters to obtain a statement data format for receiving and recording the voice, listing statement data forms of the conference secondary characters and the conference important characters according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the conference characters; converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters to perform voice classification and inputting the voice classification to a preset conference screen so as to obtain voice data which is subjected to voice recognition processing; the invention effectively reduces the problem of translation error in the process of the voice conference by correcting repeated or invalid data possibly occurring in the process of voice recognition translation conversion.

Description

System and method for processing video conference by voice recognition

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a system and a method for processing a video conference by voice recognition.

Background

With the development of communication technology, the form of a conference gradually develops towards diversification; the existing conference forms comprise not only the traditional single-point conference, but also a multipoint video conference, a multipoint voice conference and the like; the multipoint conference refers to a real-time conference established in different physical locations by means of audio communication and/or video communication, and usually different conference participants exist in each physical location.

The participants of the current video conference system can synchronously see and hear images and sounds of the participants in other conference places in real time, and can also send electronic documents in real time, thereby greatly reducing the conference cost and compressing the conference time.

However, in the prior art, the voice generated by the video conference usually generates recognition interference due to various sudden situations, which brings inconvenience to the whole process of video conference recording, for example, the translation language and the translation characters accompanying the conference screen during the video conference are deviated.

In view of the above, the present invention provides a system and a method for processing a video conference by voice recognition to solve the problem of deviation of translated languages and translated words associated with a conference screen during the video conference.

Disclosure of Invention

The invention aims to solve the problem that the translation language and the translation characters associated with a conference screen are deviated in the video conference process, and provides a system and a method for processing a video conference by voice recognition.

The invention provides a voice recognition processing video conference system, comprising:

the acquisition module is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary people characteristics and important people characteristics; recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;

the recognition module is used for recognizing the conference secondary character voice and the conference important character voice to obtain a statement data format for receiving and recording the voice, listing statement data forms of the conference secondary character and the conference important character according to the statement data format, and classifying text formats of single statements and continuous statements according to the conference character statement data forms; and converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing.

Further, the obtaining module further comprises:

the acquisition subunit is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary characteristics and important characteristics;

and the receiving and recording subunit is used for receiving and recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences.

Further, the identification module further comprises:

the recognition subunit is configured to recognize the voices of the secondary conference characters and the voices of the important conference characters to obtain a statement data format of the recorded voices, list statement data forms of the secondary conference characters and the important conference characters according to the statement data format, and classify text formats of single statements and continuous statements according to the statement data forms of the secondary conference characters and the important conference characters; (ii) a

And the operator pushing unit is used for converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification and inputting the voice classification to a preset conference screen, and further obtaining voice data which is processed by voice recognition.

The invention also provides a method for processing the video conference by voice recognition, which comprises the following steps:

acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics;

recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;

recognizing the voices of the secondary characters of the conference and the voices of the important characters of the conference to obtain a statement data format for receiving and recording the voices, listing statement data forms of the secondary characters of the conference and the important characters of the conference according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary characters of the conference and the important characters of the conference;

converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing; the specific process of statement data format conversion is as follows: the system compares the statement data with modeled sound data set by the system, matches sound data with similarity, and performs voice classification to extract acoustic features of the sound data.

Further, the step of obtaining the conference character features including the secondary character features and the important character features further comprises the following steps:

identifying all the persons participating in the conference according to a preset conference figure form;

acquiring the information of the persons participating in the conference, wherein the information of the persons participating in the conference comprises clothes of the persons, body shapes of the persons and faces of the persons;

and matching and classifying the conference people according to the conference people information, wherein the conference people are conference secondary people and conference important people.

Further, the step of receiving the voice of the secondary conference character and the voice of the important conference character further comprises:

adopting a preset sentence encoder to obtain sentence data vectors after analyzing and encoding the conference secondary character voice and the conference important character voice according to the recorded conference secondary character voice and the conference important character voice, wherein the sentence data vectors comprise a sentence text model, sentence text characteristics and a sentence text type;

performing vector word segmentation on the statement data vector to obtain a plurality of vector word segments comprising token strings of statements and token strings of lexical methods;

and respectively listing the token string of the sentence and the token string of the lexical as a first text vector and a second text vector.

Further, the step of recognizing the voices of the secondary character and the important character of the conference comprises the following steps:

generating the speech vector classes for the secondary conference character speech and the important conference character speech including speech frequency, speech filtering and speech decoding,

sampling the voice frequency according to the recorded voice, wherein the sampling amount is half of the voice frequency, and voice filtering for completing sampling is obtained;

and after waveform data are obtained by sampling according to the voice filtering, extracting waveform data characteristic parameters, and performing parameter synthesis on the waveform data characteristic parameters through voice decoding preset by the system to obtain recognizable recorded voice.

Further, the step of classifying the text formats of the single sentence and the continuous sentence includes:

according to the preset limited amount of text content, the text content range of the sentence is limited to be within fifty words or more than fifty words, and the sentence is expressed as

Or

(ii) a Checking the recorded statement data form, and selecting the text content range as

The text content of (2) is a single sentence; selecting the text content range as

Are continuous sentences.

Further, the step of classifying the single sentence includes:

arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the single statement;

obtaining a feature vector representation of mapping different potential factors in statement data to a single text through integration;

updating the connection times of the statement data feature vector and the single text feature vector on each potential factor by an iterative method;

and integrating the statement data feature vectors to obtain the single statement data after calculation.

Further, the step of speech classifying the continuous sentence comprises:

the step of classifying the continuous sentences comprises:

arranging text contents preset in the statement data calculation layer to obtain statement data characteristic vectors of the continuous statements;

obtaining feature vector representations of different potential factors mapped to a plurality of texts in statement data through integration;

updating the connection times of the sentence data feature vector and the text feature vectors on each potential factor by an iterative method;

and integrating the sum of the statement data feature vectors to obtain the continuous statement data after calculation.

The invention provides a system and a method for processing a video conference by voice recognition, which have the following beneficial effects:

the invention effectively reduces the problem of translation error in the voice conference process by correcting repeated or invalid data possibly occurring in the voice recognition translation conversion process.

Drawings

FIG. 1 is a block diagram of a voice recognition processing video conferencing system according to an embodiment of the present invention;

fig. 2 is a flowchart illustrating a voice recognition processing video conference method according to an embodiment of the present invention.

Detailed Description

It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be considered as limiting thereof, since the objects, features and advantages thereof will be further described with reference to the accompanying drawings.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by persons skilled in the art from the embodiments given herein without making any inventive step, are within the scope of the present invention.

Referring to fig. 1, for a voice recognition processing video conferencing system in one embodiment of the present invention,

the acquisition module is used for acquiring the conference character characteristics including secondary character characteristics and important character characteristics; recording the voice of the secondary characters of the conference and the voice of the important characters of the conference;

In a specific embodiment: the acquisition module acquires the conference character features including secondary character features and important character features; recording the voice of the secondary characters of the conference and the voice of the important characters of the conference; the recognition module recognizes the conference secondary character voice and the conference important character voice to obtain a statement data format of the recorded voice, lists statement data forms of the conference secondary character and the conference important character according to the statement data format, and classifies text formats of single statements and continuous statements according to the conference character statement data forms; converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing;

wherein, the characters are specifically as follows: sound characteristics of the secondary and important meeting characters, including different tones and wavelengths emitted for each character; the character voice is specifically the sound color and the loudness of the sentence of each character in the conference;

the statement data format specifically includes: framing the voice signal of each person in the conference, wherein the framing is required because the voice signal is rapidly changed; in voice recognition, the frame length is generally 20-50 ms, so that a sufficient period is provided in one frame, and the change is not too violent; each frame of voice signal is usually multiplied by the next frame of voice signal, so that the two ends of the frame are smoothly attenuated to zero, the strength after voice conversion can be reduced, a higher quality frequency spectrum is obtained, the time difference between the frames is generally 10s, and thus, the frames are overlapped, so that the signals at the joint of the frames are not lost due to loss of signal overlapping;

the text format of the single sentence is specifically as follows: short-time or instantaneous speech data generated by each person in the conference, such as: the words of "forehead", "kayi", "yes" and "pair" made by the character;

the text format of the continuous sentence is specifically as follows: long-term or long-string speech data generated by each person in the conference, such as: the beginning of the day is white, that is, the girls are good and welcome to go to the meeting … …;

the voice data form specifically includes: all voice data generated in the whole conference process of the conference secondary characters and the conference important characters comprise the voice data before the conference process, in the conference process and after the conference process, and the voice data before the conference process, in the conference process and after the conference process are re-integrated in sequence to obtain a voice data form;

the speech classification specifically includes: extracting acoustic features of sound data in a system database, wherein the acoustic features comprise word sequences, position codes, phoneme sequences and phoneme features in sentences, and splicing the word sequences, the position codes, the phoneme sequences and the phoneme features to obtain word acoustic features;

the specific process of sentence data feature transformation is as follows: acquiring a text vector and a token string of the voice data, merging the text vector and the token string into a text token set, identifying character features of the character strings in the text token set, constructing a character vector corresponding to the character features, and recombining the character vector set of the text token set; similar elimination is carried out on the reformed and combined character vector set to obtain voice data finished by voice classification, and the method specifically comprises the following steps: conversion between different voices; for example: english is converted into Chinese, and Chinese is converted into English;

the similar elimination of the character vector set after the reintegration is specifically to add the constructed words into the character strings in the token set, search whether the character strings in the token set have the same character strings or the vectors of the same character strings, and correspondingly delete and correct the same character strings or the vectors of the same character strings.

Referring to fig. 2, for a method for processing a video conference by speech recognition according to an embodiment of the present invention,

s1: acquiring the characteristics of the conference characters, wherein the characteristics of the conference characters comprise secondary character characteristics and important character characteristics;

s2: recording the secondary conference character voice and the important conference character voice, wherein the secondary conference character voice and the important conference character voice comprise character voice and character sentences;

s3: recognizing the voices of the secondary characters of the conference and the voices of the important characters of the conference to obtain a statement data format for receiving and recording the voices, listing statement data forms of the secondary characters of the conference and the important characters of the conference according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary characters of the conference and the important characters of the conference;

s4: converting sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing; the specific process of statement data format conversion is as follows: the system compares the statement data with modeled sound data set by the system, matches sound data with similarity, and performs voice classification to extract acoustic features of the sound data.

In a specific embodiment: acquiring the characteristics of the conference characters, recording the voices of the secondary conference characters and the voices of the important conference characters, identifying the voices of the secondary conference characters and the voices of the important conference characters to obtain a sentence data format of the recorded voices, listing sentence data forms of the secondary conference characters and the important conference characters according to the sentence data format, classifying text formats of single sentences and continuous sentences according to the sentence data forms of the conference characters, converting the sentence data characteristics in the sentence data forms of the secondary conference characters and the important conference characters to perform voice classification, inputting the converted sentence data characteristics to a preset conference screen, and further obtaining voice data after voice recognition processing.

In one embodiment: the step of acquiring the conference character features including the secondary character features and the important character features further comprises the following steps:

In a specific embodiment: capturing all the people participating in the conference, and carrying out matching comparison with preset conference participants according to the clothing, the body shapes and the faces of the people of the conference people so as to judge secondary people and important people in the people participating in the conference; and performs corresponding matching operation and classification according to the identified persons,

the person who does not wear the conference clothes is discriminated as a secondary person, for example: cleaning a sanitary cleaner or finishing a meeting process secretary;

important persons to be discriminated from the person wearing the conference clothes are, for example: a manager wearing meeting clothing and a card or a CEO leading the meeting process.

In one embodiment: the step of recording the voice of the secondary conference character and the voice of the important conference character further comprises the following steps:

In a specific embodiment: performing voice coding analysis on the conference figure voice by adopting a preset sentence coder to obtain a sentence data vector which is specifically a sentence text model, a sentence text characteristic and a sentence text type; carrying out vector word segmentation on the sentence text model, the sentence text characteristics and the sentence text type to obtain a plurality of vector word segments comprising token strings of sentences and token strings of lexical methods;

the voice coding analysis specifically comprises the steps of compressing and decompressing the audio signals through a dual-rate voice coding algorithm with the type of audio and the bandwidth of 5.3kbps, and due to the adoption of mute compression for executing discontinuous transmission, the bandwidth can be reserved, and meanwhile, the on-off and on-off of carrier signals are avoided;

vector word segmentation specifically converts token words in the sentence data into token strings which are combined, for example: the single character in the sentence data is converted into a sentence, and the sentence is 'happy', 'welcomed', 'big', 'home', 'participated', 'added', 'meeting' and 'conference', and is combined into 'welcome people to attend the conference'.

In one embodiment: the step of identifying the voice of the secondary conference character and the voice of the important conference character comprises the following steps:

generating the voice vector classes for the conference secondary character voice and the conference important character voice into voice frequency, voice filtering and voice decoding,

In a specific embodiment: generating the recorded voice into a voice vector category comprising voice frequency, voice filtering and voice decoding; sampling the frequency of the recorded voice, and taking half of the voice frequency to obtain voice filtering after sampling; acquiring waveform data by collecting samples in the sampled voice filtering, extracting characteristic parameters of the waveform data, and performing parameter synthesis by combining voice decoding and the characteristic parameters of the waveform data to obtain recognizable recorded voice;

the sampling process of the voice frequency is specifically to collect a frequency with a special wavelength in the voice frequency, for example: the frequency of the more raised voice in the voice generating process is slightly different from the frequency of other voices, and the frequency is a special frequency;

extracting the characteristic parameters of the waveform data, specifically recording the special frequency voice in the voice generating process according to the time sequence, and taking the special frequency with the time sequence as the characteristic parameters;

the process of parameter synthesis by matching voice decoding preset by the system with waveform data characteristic parameters is specifically that parameters are changed once aiming at the decoding of each frame of voice data, and for frequency segments with turbid voice, frame synchronization synthesis is carried out according to different selections of control parameter change moments; frame-synchronous synthesis refers to changing parameters on a frame-by-frame basis.

In one embodiment: the step of classifying the text formats of the single sentence and the continuous sentence includes:

Or

Are continuous sentences.

In a specific embodiment: respectively obtaining text format contents of a single sentence and a continuous sentence, and performing range limit on the text contents of the single sentence according to a preset text content limit, wherein the text content limit range of the single sentence is specifically

(ii) a Performing range limit on the text content of the continuous sentences, wherein the range limit of the text content of the continuous sentences is specifically

。

In one embodiment: the step of classifying the single sentence comprises the following steps:

obtaining the feature vector representation of different potential factors mapped to the text in the statement data through integration;

updating the connection times of the statement data characteristic vector and the text characteristic vector on each potential factor by an iterative method;

In a specific embodiment: the sentence data of the single sentence is sorted to obtain the sentence data characteristic vector of the single sentence, wherein the sentence data characteristic vector comprises a word sequence and a position code; integrating the word sequence and the position codes in statement data to obtain feature vectors of other different potential factor mapping texts, wherein the feature vectors comprise phoneme sequences and phoneme features; performing vector connection on the statement data by using the statement data characteristic vector and the characteristic vector of the latent factor mapping text through an iteration method, and reintegrating the iteration-completed statement data to obtain the calculation-completed single statement data;

the integration specifically comprises the steps of calling out word sequences and position codes corresponding to the word sequences from all data in statement data;

the process of vector connection of the sentence data through iteration is specifically that word sequences and position codes are indexed relative to the sentence data, and feature vectors, namely phoneme sequences and phoneme features, of different potential factor mapping texts are mapped; and connecting the word sequence, the position code, the phoneme sequence and the phoneme characteristic based on vectors in the sentence data to obtain single classified sentence data.

In one embodiment: the step of classifying the continuous sentences comprises:

In a specific embodiment: sorting the text content of the sentence data of the continuous sentences to obtain the sentence data characteristic vectors of the continuous sentences, wherein the sentence data characteristic vectors comprise word sequences and position codes; integrating the word sequences and the position codes in statement data to obtain feature vectors of other different potential factors mapping a plurality of texts, wherein the feature vectors comprise a plurality of phoneme sequences and a plurality of phoneme features; performing vector connection on statement data by using the statement data characteristic vector and the characteristic vectors of the texts with the latent factors mapped through an iteration method, and reintegrating the iteration-completed statement data to obtain calculated and completed continuous statement data;

the process of vector connection of the sentence data through iteration is specifically that word sequences and position codes are indexed relative to the sentence data, and different potential factors are mapped to map feature vectors of a plurality of texts, namely a plurality of phoneme sequences and a plurality of phoneme features; and connecting the word sequence, the position code, the phoneme sequence and the phoneme characteristic based on vectors in the sentence data to obtain classified continuous sentence data, wherein the position code of the word sequence and the phoneme sequence exist in the phoneme sequence and the position code relation of the phoneme characteristic as a symmetry axis, namely the position code of the word sequence and the phoneme characteristic are based on a symmetrical position which is a coordinate in the sentence data.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A speech recognition processing videoconferencing system, the speech recognition processing videoconferencing system comprising:

the acquisition module is used for acquiring the characteristics of the conference people, wherein the characteristics of the conference people comprise secondary people characteristics and important people characteristics; receiving and recording the voice of a secondary conference character and the voice of an important conference character, wherein the voice of the secondary conference character and the voice of the important conference character comprise character voice and character sentences;

the recognition module is used for recognizing the voice of the secondary conference character and the voice of the important conference character to obtain a statement data format for receiving and recording the voice, listing statement data forms of the secondary conference character and the important conference character according to the statement data format, and classifying text formats of single statements and continuous statements according to the statement data forms of the secondary conference character and the important conference character; and converting the sentence data characteristics in the sentence data form of the conference secondary characters and the conference important characters, performing voice classification, and inputting the voice classification to a preset conference screen to obtain voice data subjected to voice recognition processing.

2. The speech recognition processing videoconferencing system of claim 1, wherein the obtaining module further comprises:

3. The speech recognition processing videoconferencing system of claim 1, wherein the recognition module further comprises:

the recognition subunit is configured to recognize the voices of the secondary conference characters and the voices of the important conference characters to obtain a statement data format of the recorded voices, list statement data forms of the secondary conference characters and the important conference characters according to the statement data format, and classify text formats of single statements and continuous statements according to the statement data forms of the secondary conference characters and the important conference characters;

4. A voice recognition processing video conference method, wherein a voice recognition processing video conference system according to any one of claims 1 to 3 is executed by using the voice recognition processing video conference method, and the voice recognition processing video conference method comprises:

5. The method of claim 4, wherein the step of obtaining the conference personality characteristics includes a secondary personality characteristic and an important personality characteristic further comprises:

6. The method of claim 4, wherein the step of recording the secondary character speech and the important character speech further comprises:

adopting a preset sentence encoder to obtain a sentence data vector after analyzing and encoding the conference character voice according to the recorded and recorded conference secondary character voice and the conference important character voice, wherein the sentence data vector comprises a sentence text model, a sentence text feature and a sentence text type;

7. The method of claim 4, wherein the step of recognizing the voices of the secondary character and the important character comprises:

8. The method of claim 4, wherein the step of classifying the text format of the single sentence and the continuous sentence comprises:

Or

Are continuous sentences.

9. The method of claim 4, wherein the step of classifying the single sentence comprises:

10. The method of claim 4, wherein the step of classifying the continuous sentences comprises: