CN112802480B

Patents

Full documents

Title

Abstract

Claims

All

Any

Exact

Not

Add AND condition

These CPCs and their children

These exact CPCs

Add AND condition

Exact

Exact Batch

Similar

Substructure

Substructure (SMARTS)

Full documents

Claims only

Add AND condition

Application Numbers

Publication Numbers

Either

Add AND condition

Voice data text conversion method based on multi-party communication

Abstract

The invention relates to the technical field of digital information transmission, in particular to a voice data text conversion method based on multi-party communication. The method comprises the steps of recognizing preset passwords input by a multi-party device end, converting voice data communicated by each device end in group chat into characters, storing the voice data and the converted character data through a memory and integrating key character data and key titles. According to the invention, the characters converted from the voice data of multi-party communication are integrated through the integration of the key titles and the key character data, and the key titles are integrated through a pre-selection marking mode, so that the problem of insufficient pertinence of voice data conversion in the prior art is solved, and the efficiency of later-stage manual screening is greatly improved after the arrangement.

Images (0)

Classifications

G10L15/063

Training

Landscapes

Engineering & Computer Science

Artificial Intelligence

CN112802480B

China

Download PDF

Find Prior Art

Similar

Other languages: Chinese
Inventor: 江合文
Current Assignee The listed assignees may be inaccurate. : Guangdong International Science And Technology Co ltd

2021

2021-04-15

Application filed by Guangdong International Science And Technology Co ltd

2021-04-15

Priority to CN202110404363.1A

2021-05-14

Publication of CN112802480A

2021-07-13

Application granted

2021-07-13

Publication of CN112802480B

Status

Active

2041-04-15

Anticipated expiration

Info: Patent citations (26); Non-patent citations (2); Legal events; Similar documents; Priority and Related Applications
External links: Espacenet; Global Dossier; Discuss

Description

Voice data text conversion method based on multi-party communication

Technical Field

The invention relates to the technical field of digital information transmission, in particular to a voice data text conversion method based on multi-party communication.

Background

Currently, with the continuous update of chat tools, there has been a conversion from previous text chat to voice chat, in which:

the chat tool is also called IM software or IM tool, and refers to a tool for providing a client based on the Internet to perform real-time voice and text transmission, technically, the chat tool is mainly divided into server-based IM tool software and P2P-based IM tool software, the biggest difference between the real-time messaging and the e-mail is that the chat tool does not need to wait, and does not need to press 'transmission and reception' every two minutes, so long as two persons are on line simultaneously, the chat tool can transmit text, files, sound and images to the other party like a multimedia telephone, and as long as the network is provided, no matter how far apart the other party is in the sky and sea corner, or the two parties have no distance.

Therefore, many enterprises and schools give lessons and apply the digital information transmission technology of real-time messaging, namely, data transmission is carried out on a plurality of equipment ends by establishing group chat, but the existing video group chat and voice group chat only convert all voice data in the communication process, the character conversion is not targeted enough, and people who need to be finished later screen characters which do not need to be converted.

Disclosure of Invention

The invention aims to provide a voice data text conversion method based on multi-party communication so as to solve the problems in the background technology.

In order to achieve the above object, the present invention provides a method for converting voice data text based on multi-party communication, which comprises the following steps:

firstly, a preset password input by a multi-party equipment end is identified, and the password comprises two gestures:

if the gesture I is correct and the preset password is correct, marking the equipment end, outputting the mark of each equipment end, and constructing group chat according to the mark of the equipment end;

secondly, if the preset password is incorrect, continuing to pop up the input window;

performing character conversion on voice data communicated by each equipment end in group chat;

storing the voice data and the converted text data through a memory;

extracting voice data output by a preselected marking device end and character data converted by the preselected marking device end in a memory, then identifying key data information of the preselected marking device end according to the extracted character data to form a key title, and then extracting voice data output by the other marking device ends before the next key title appears after the key title and character data converted by the other marking device ends to form key character data;

and integrating the key character data and the key title, specifically, screening the key character data according to the key title to screen out the value character data, and supplementing the value character data, the voice data and the equipment end mark into a display frame of the group chat in a mutually corresponding manner.

As a further improvement of the technical scheme, the key data information of the preselected marking equipment end comprises key character information, tone word-aid information and keyword extraction information.

As a further improvement of the technical scheme, the key data information extraction adopts a weighted extraction algorithm, and the algorithm steps are as follows:

punctuation mark punctuation sentences are carried out according to the sound intervals and the tone of the sound in the voice data, wherein the punctuation marks comprise periods, question marks and exclamation marks;

quantizing the word frequency, word length, word property, position and dictionary factors of character data at the end of the pre-selection marking equipment by using the weighting factors, and performing weight calculation after quantization to obtain the total weight of each factor;

and sequencing words corresponding to the weight values in a descending order mode to obtain a keyword list, and acquiring key data information through the keyword list.

As a further improvement of the technical solution, the total weight calculation formula of the factors is as follows:

；

wherein,

as words and phrases

The factor total weight of the text data;

the word frequency factor is the ratio;

is a word frequency factor;

is the word length factor ratio;

is a word length factor;

is the ratio of parts-of-speech factors;

is a part of speech;

is the ratio of the position factors;

is the ratio of the positions;

is the ratio of dictionary factors;

is a dictionary factor, and

。

as a further improvement of the technical scheme, the Chinese character conversion comprises the following specific steps:

firstly, extracting audio data output by an equipment end, and then training the audio data by using a Gaussian mixture learning algorithm;

decomposing a harmonic plus noise model of the audio output voice of the extraction source, correcting the decomposed model by using average fundamental frequency comparison to obtain corresponding corrected harmonic amplitude and phase parameters, extracting the characteristics of the harmonic amplitude and phase parameters to obtain linear spectral rate parameters, mapping the linear spectral rate parameters by using a Gaussian mixture model, and fusing the mapped linear spectral rate parameter characteristics;

and performing mixed output by using the corrected harmonic amplitude and phase parameters, and then extracting text data of the source audio output voice.

As a further improvement of the technical solution, the gaussian mixture learning algorithm includes the following steps:

firstly, training source audio output voice and target audio output voice, and decomposing corresponding harmonic and noise models;

calculating the average fundamental frequency ratio of the fundamental frequency tracks of the two output voices, and simultaneously performing feature extraction on the harmonic amplitude and phase parameters of the two output voices to obtain corresponding linear spectrum rate parameters;

and (4) carrying out dynamic time warping on the obtained linear frequency spectrum rate parameters, and obtaining a Gaussian mixture model by using a variational Bayes estimation algorithm.

As a further improvement of the technical solution, a calculation formula of the variational bayes estimation algorithm is as follows:

；

wherein:

is the logarithmic edge density;

to observe an audio variable;

outputting a text variable of the voice for the source audio;

for given purpose

About

A posterior probability of (d);

is composed of

A priori probability of.

Compared with the prior art, the invention has the beneficial effects that:

according to the voice data and character conversion method based on multi-party communication, the characters converted from the voice data of multi-party communication are integrated through the key titles and the key character data, and the key titles are integrated in a pre-selection marking mode, so that the problem that the voice data conversion pertinence is not enough in the prior art is solved, and the efficiency of later-stage manual screening is greatly improved after arrangement.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a flowchart of the Gaussian mixture learning algorithm training steps of the present invention;

FIG. 3 is a flowchart of the Gaussian mixture learning algorithm transformation steps of the present invention;

FIG. 4 is a diagram of a first embodiment of a display frame;

FIG. 5 is a second schematic diagram of a display frame according to the present invention;

FIG. 6 is a schematic view of VB-GMM algorithm and GMM algorithm broken lines of the present invention.

Detailed Description

Example 1

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution:

the invention provides a voice data text conversion method based on multi-party communication, which comprises the following steps:

the voice data and the text data converted by the voice data are stored through the memory, then the text data converted by the text data are displayed in the display frame, please refer to fig. 4, the displayed text data are convenient for memorizing in the process of arranging meeting records or learning notes at the later stage, and the problem that the text can not be extracted and memorized after the video meeting or the video learning is solved.

Example 2

In order to improve the security of group chat and prevent non-group chat people from joining the group chat, the embodiment is different from embodiment 1 in that a preset password input by a multi-party device end is firstly identified, and the preset password comprises two gestures:

if the first posture and the preset password are correct, the equipment end is marked, the marks of the equipment ends are output, group chat is built according to the marks of the equipment ends, and therefore people in the group chat are distinguished and divided in a marking mode, and distinguishing is carried out in a mode of adding a specific mark, for example: if the group chats to be an enterprise group, the marking mode comprises a boss mode and an employee mode; if the group chat is a learning group, the marking mode comprises a teacher and a student, and the identification degree of the members in the group chat is further improved;

and secondly, if the preset password is incorrect, the input window is continuously popped up, and the equipment end with the incorrect preset password cannot join the group chat, so that the safety of the group chat is greatly improved, and the problem that non-group chat personnel join the group chat is solved.

Example 3

In order to improve the pertinence of voice data conversion, the embodiment is different from embodiment 2 in that voice data output by a preselected marking device end and text data after conversion thereof are extracted from a memory, then key data information of the preselected marking device end is identified according to the extracted text data to form a key title, and then voice data output by the other marking device ends before a next key title appears after the key title and the text data after conversion thereof are extracted to form the key text data;

In addition, the key data information of the preselection marking device end comprises key character information, tone word-aid information and keyword extraction information.

In specific use of the embodiment, the embodiment is exemplified by a conference of an enterprise, an alternating group chat is constructed by means of password input, assuming that an equipment end set S = (a, a1, a2, a 3) in the group chat, and S = (boss a, employee a1, employee a2, employee a 3) after marking, at this time, "boss a" is set as a preselected mark, when the boss a issues "how you still have problems in the above description" in the group chat, so as to obtain a word "how" the language help word is, the "how you still have problems in the above description" is determined as a key title, and then the employee a1, a2, a3 outputs "problem 1: how to improve the normal work efficiency, and "problem 2: unknown "," problem 3: how to realize the mutual supervision of employees in the working process, and other voice data are determined as key word data, and then the word data which are not in conformity with the key title are determined as the problem 2: not knowing "culling, left" problem 1: "question 3" and "integrate the display through the display box for what question you still have" as described above, see fig. 5, where:

b, boss A: "how do you have problems with the above description";

employee a 1: "problem 1: how to improve the working efficiency at ordinary times ";

employee a 3: "problem 3: how to realize the mutual supervision of the employees in the working process ".

Therefore, the key titles and the key word data are integrated to integrate the words converted from the voice data of multi-party communication, and the key titles are integrated in a pre-selection marking mode, so that the problem of insufficient pertinence of voice data conversion in the prior art is solved, and the efficiency of later-stage manual screening is greatly improved after arrangement.

Example 4

In order to improve the accuracy of extracting the key data information, the embodiment is different from embodiment 3 in that a weighted extraction algorithm is adopted for extracting the key data information, and the algorithm steps are as follows:

Specifically, the total weight of the factors is calculated as follows:

；

wherein,

as words and phrases

The factor total weight of the text data;

the word frequency factor is the ratio;

is a word frequency factor;

is the word length factor ratio;

is a word length factor;

is the ratio of parts-of-speech factors;

is a part of speech;

is the ratio of the position factors;

is the ratio of the positions;

is the ratio of dictionary factors;

is a dictionary factor, and

in the present embodiment, the determination

The method of reverse reasoning using large-scale corpus, preferably fuzzy processing, will be used according to various importance of the result

The value assigned was 0.4,

The value assigned was 0.2,

And

the value assigned was 0.15,

The value is assigned to 0.1, and then a candidate keyword table A is obtained through weight calculation, and the generation principle of the primary candidate keywords is as follows:

words of unspecific part of speech (noun, verb, adjective, idiom) or words which do not appear in the title sentence, the head of the paragraph, and the tail of the paragraph and have the word frequency of 1 are filtered out.

If the Total word number of the article is Total, and the extraction number of the keywords is k, k should satisfy:

k = Total35% if Total35% < 20; if Total35% > =20, k keywords extracted through the above two steps are used as primary candidate keywords, and therefore accuracy of key data information obtained through weight calculation is greatly improved.

Example 5

In order to improve the robustness of voice conversion in the rare data environment, the present embodiment is different from embodiment 1, please refer to fig. 2 and fig. 3, wherein:

the Chinese character conversion method comprises the following specific steps:

In addition, the Gaussian mixture learning algorithm comprises the following steps:

Specifically, when in use, the Gaussian mixture model adopts a VB-GMM algorithm, and firstly outputs source audio to voice

And target audio output speech

Combined into an extended vector

Then is aligned with

Referring to fig. 6, the horizontal axis direction is the number of degrees of mixing, the vertical axis direction is the logarithmic distortion (unit: dB), specifically, the conversion error of point (1) decreases with the increasing data amount, which shows that the more sufficient the data amount, the more sufficient the model training, the better the conversion effect, point (2) shows the better than the standard solid line portion GMM (error distortion is about 0.5dB lower) with the less data amount (training data is less than 500 frames), the difference between the two performances decreases with the increasing data amount, when the data amount is relatively sufficient (training data is more than 5000 frames), the difference between the two tends to be balanced (error distortion is about 0.23dB and the above theory is the same, when the data amount tends to infinity, the result of VB-GMM estimation is approximately equal to the result of the maximum likelihood estimation, point (3) trains data about 3000 frames, the performance of the two kinds of data reaches a local low point phenomenon, and the correlation between 3000 frames of training data and test data is strong, so the conversion effect is very good.

It is worth noting that: the symbols of the points (1) - (6) refer to the optimal mixing degree numbers of the two under a certain training data volume (the standard GMM and the VB-GMM both adopt the same optimal mixing degree), and the optimal value is automatically obtained according to the VB-GMM algorithm (for different numbers of data, we initialize 32 mixing degrees, and the finally obtained mixing degree value is the self-optimization result of the VB-GMM algorithm), so that the problem of 'overfitting' is avoided, and the robustness of voice conversion under the rare data environment is improved.

Further, the calculation formula of the variational bayesian estimation algorithm is as follows:

；

wherein:

is the logarithmic edge density;

to observe an audio variable;

outputting a text variable of the voice for the source audio;

for given purpose

About

A posterior probability of (d);

is composed of

By a priori probability of all possible pairs, in particular

By way of integration to estimate

。

The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (2)

Hide Dependent

1. The voice data text conversion method based on multi-party communication is characterized by comprising the following steps:

storing the voice data and the converted text data through a memory;

integrating the key character data and the key title, specifically, screening the key character data according to the key title to screen out the value character data, and supplementing the value character data, the voice data and the equipment end mark into a display frame of the group chat in a mutually corresponding manner;

the key data information is extracted by adopting a weighted extraction algorithm, and the algorithm comprises the following steps:

punctuation sentence segmentation is carried out according to the sound interval and the tone of the sound in the voice data;

sorting the words corresponding to the weight by using a descending order mode to obtain a keyword list, and acquiring key data information through the keyword list;

the factor total weight value calculation formula is as follows:

；

wherein,

as words and phrases

The factor total weight of the text data;

the word frequency factor is the ratio;

is a word frequency factor;

is the word length factor ratio;

is a word length factor;

is the ratio of parts-of-speech factors;

is a part of speech;

is the ratio of the position factors;

is the ratio of the positions;

is the ratio of dictionary factors;

is a dictionary factor, and

。

2. the method for converting text into voice data based on multiparty communication according to claim 1, wherein: the key data information of the preselection marking equipment end comprises key character information, tone word-aid information and keyword extraction information.

Patent Citations (26)

Publication number Priority date Publication date Assignee Title

CN103150388A

* 2013-03-21 2013-06-12 天脉聚源(北京)传媒科技有限公司 Method and device for extracting key words

CN103326929A

* 2013-06-24 2013-09-25 北京小米科技有限责任公司 Method and device for transmitting messages

CN106487757A

* 2015-08-28 2017-03-08 华为技术有限公司 Carry out method, conference client and the system of voice conferencing

CN106652995A

* 2016-12-31 2017-05-10 深圳市优必选科技有限公司 Text voice broadcasting method and system

CN107733666A

* 2017-10-31 2018-02-23 珠海格力电器股份有限公司 Conference implementation method and device and electronic equipment

CN109474763A

* 2018-12-21 2019-03-15 深圳市智搜信息技术有限公司 A kind of AI intelligent meeting system and its implementation based on voice, semanteme

CN109508214A

* 2017-09-15 2019-03-22 夏普株式会社 The recording medium of display control unit, display control method and non-transitory

CN110019744A

* 2018-08-17 2019-07-16 深圳壹账通智能科技有限公司 Auxiliary generates method, apparatus, equipment and the computer storage medium of meeting summary

CN110505201A

* 2019-07-10 2019-11-26 平安科技（深圳）有限公司 Conferencing information processing method, device, computer equipment and storage medium

CN110517689A

* 2019-08-28 2019-11-29 腾讯科技（深圳）有限公司 A kind of voice data processing method, device and storage medium

CN110889266A

* 2019-11-21 2020-03-17 北京明略软件系统有限公司 Conference record integration method and device

CN111415128A

* 2019-01-07 2020-07-14 阿里巴巴集团控股有限公司 Method, system, apparatus, device and medium for controlling conference

CN111627446A

* 2020-05-29 2020-09-04 国网浙江省电力有限公司信息通信分公司 Communication conference system based on intelligent voice recognition technology

CN111859006A

* 2019-04-17 2020-10-30 上海颐为网络科技有限公司 Method, system, electronic device and storage medium for establishing voice entry tree

CN112468665A

* 2020-11-05 2021-03-09 中国建设银行股份有限公司 Method, device, equipment and storage medium for generating conference summary

Family To Family Citations

CN100438542C

* 2004-05-27 2008-11-26 华为技术有限公司 Method for implementing telephone conference

US7831427B2

* 2007-06-20 2010-11-09 Microsoft Corporation Concept monitoring in spoken-word audio

US8797380B2

* 2010-04-30 2014-08-05 Microsoft Corporation Accelerated instant replay for co-present and distributed meetings

CN102215238B

* 2011-07-27 2013-12-18 中国电信股份有限公司 Service processing method and system fused with video conference and user terminal

CN103631780B

* 2012-08-21 2016-11-23 重庆文润科技有限公司 Multimedia recording systems and method

CN106802885A

* 2016-12-06 2017-06-06 乐视控股（北京）有限公司 A kind of meeting summary automatic record method, device and electronic equipment

CN106791584A

* 2017-02-07 2017-05-31 上海与德信息技术有限公司 The implementation method of video conference, cut-in method and related device

US10510346B2

* 2017-11-09 2019-12-17 Microsoft Technology Licensing, Llc Systems, methods, and computer-readable storage device for generating notes for a meeting based on participant actions and machine learning

CN109302578B

* 2018-10-23 2021-03-26 视联动力信息技术股份有限公司 Method and system for logging in conference terminal and video conference

CN110489979A

* 2019-07-10 2019-11-22 平安科技（深圳）有限公司 Conferencing information methods of exhibiting, device, computer equipment and storage medium

CN111986677A

* 2020-09-02 2020-11-24 深圳壹账通智能科技有限公司 Conference summary generation method and device, computer equipment and storage medium

* Cited by examiner, † Cited by third party

Non-Patent Citations (2)

Title

一种稀少训练数据条件下的语音转换算法;徐宁等;《南京邮电大学学报（自然科学版）》;20101031;第1-7页 *

中文文本关键词提取算法;张红鹰;《计算机系统应用》;20091231;第73-76页 *

* Cited by examiner, † Cited by third party

Priority And Related Applications

Priority Applications (1)

Application Priority date Filing date Title

CN202110404363.1A

2021-04-15 2021-04-15 Voice data text conversion method based on multi-party communication

Applications Claiming Priority (1)

Application Filing date Title

CN202110404363.1A

2021-04-15 Voice data text conversion method based on multi-party communication

Legal Events

Date Code Title Description

2021-05-14 PB01 Publication

2021-06-01 SE01 Entry into force of request for substantive examination

2021-07-13 GR01 Patent grant

Concepts

Download

Name Image Sections Count Query match

chemical reaction

title,claims,abstract,description 26 0.000

method

title,claims,abstract,description 16 0.000

communication

title,claims,abstract,description 14 0.000

communication

title,claims,abstract,description 13 0.000

screening

claims,abstract,description 6 0.000

extraction

claims,description 12 0.000

calculation method

claims,description 9 0.000

quantization

claims,description 3 0.000

supplementing effect

claims,description 3 0.000

segmentation

claims 1 0.000

biological transmission

abstract,description 6 0.000

integration

abstract,description 2 0.000

Show all concepts from the description section