CN110265027B

CN110265027B - Audio transmission method for conference shorthand system

Info

Publication number: CN110265027B
Application number: CN201910532574.6A
Authority: CN
Inventors: 虞焰兴; 徐勇
Original assignee: Anhui Semxum Information Technology Co ltd
Current assignee: Anhui Semxum Information Technology Co ltd
Priority date: 2019-06-19
Filing date: 2019-06-19
Publication date: 2021-07-27
Anticipated expiration: 2039-06-19
Also published as: CN110265027A

Abstract

The invention discloses an audio transmission method for a conference shorthand system, wherein when a conference is in progress, a conference shorthand terminal collects conference audio, cuts and numbers audio streams according to natural sentences, and respectively sends the cut audio segments to an ASR server and a collaborative editing server; the ASR server is used for converting the content of the audio segment into primary text, and the collaborative editing server is used for audio positioning. The conference shorthand terminal cuts the audio stream according to the natural sentence, so that the occupied bandwidth in the audio transmission process is reduced, the transmission is faster, and the text return speed of the ASR server is also faster; after the audio segment and the text corresponding to the audio segment are transmitted to the manual editing terminal, the text corresponding to the audio segment can be corrected according to the audio segment, and therefore real-time correction of the dynamically generated conference record is achieved.

Description

Audio transmission method for conference shorthand system

Technical Field

The invention relates to the technical field of voice shorthand, in particular to an audio transmission method for a conference shorthand system.

Background

During the meeting, the organization and the specific content of the meeting are recorded by the recording personnel, and then the meeting record is formed. The most traditional form is to shorthand the recording staff on site and check the meeting record against the meeting record collations after the meeting is over.

With the development of a speech recognition technology (ASR) and a natural language processing technology (NLP), audio generated in a conference can be directly converted into characters in real time at a conference site and conference records are generated, and the workload of a recorder is greatly reduced.

Speech recognition technology is the conversion of lexical content in human speech into computer readable input, such as keystrokes, binary codes or character sequences; the natural language processing technology researches how to realize effective communication between people and computers by using natural language; the combination of the two can convert human speech into written expression form-text of human language. However, this conversion process does not guarantee a hundred percent accuracy, and particularly for terms, person names, etc. that are not entered into the system, the system has no way of determining what words should be. For example, inputting voice "chapter yi", the system can recognize the name of this star and convert it into correct text; the method is characterized in that the voice Zhangiang is input, for the strange phrase, the system can only transliterate word by word and select default options set by the system, for example, when the system defaults to Zhang and give priority to chapters, the voice Zhangiang may be converted into the word Zhangiang, and errors exist. Of course, the actual error is not limited thereto.

The accuracy of the conventional conference shorthand system is basically about 90-95%, and errors in a text need to be corrected. At present, the adopted correction mode is mainly that after the meeting is finished, a recording person sorts and checks meeting records according to meeting records, so that the generation of meeting record draft has certain time delay and certain inconvenience. The most ideal correction method is to correct the text converted from the audio in real time, but the technical obstacle is how to realize timely and fast correction of the text while the audio is being recorded and the text is being generated, that is, how to timely and fast correct the dynamically generated text.

Disclosure of Invention

In view of the above problems, the present invention provides an audio transmission method that is helpful for a conference shorthand system to realize timely and fast correction of a dynamically generated text.

A audio transmission method for a conference shorthand system is characterized in that when a conference is in progress, a conference shorthand terminal collects conference audio, cuts and numbers audio streams according to natural sentences, and respectively sends the cut audio segments to an ASR server and a collaborative editing server; the ASR server is used for converting the content of the audio segment into primary text, and the collaborative editing server is used for audio positioning.

Further, when the conference shorthand terminal cuts the audio stream, recording the start time, the end time and the audio code of each audio segment and generating a log file; and the conference stenography terminal sends one or more audio characteristic information connected audio segments in the log file to the ASR server together.

Further, the time interval between cutting audio segments is 0.00001 ms; the audio segment duration is limited to within 60 s.

Further, the conference shorthand terminal copies the audio stream and sends the audio stream to the collaborative editing server while cutting the audio stream.

Further, when the conference shorthand terminal detects that the network is interrupted, the data transmission to the ASR server is stopped, the data is temporarily stored in the memory, and when the network is connected again, the data is sequentially transmitted to the ASR server through the memory.

The invention has the beneficial effects that: 1. the conference shorthand terminal cuts the audio stream according to the natural sentence, so that the occupied bandwidth in the audio transmission process is reduced, the transmission is faster, and the text return speed of the ASR server is also faster; after the audio segment and the text corresponding to the audio segment are transmitted to the manual editing terminal, the text corresponding to the audio segment can be corrected according to the audio segment, so that the dynamically generated conference record can be corrected in real time; 2. the audio characteristic information and the audio segments are sent to an ASR server together, so that the audio segments and the text are conveniently in one-to-one correspondence by the collaborative editing server; 3. the processing mechanism when dealing with the network disconnection can well solve the problem of audio and text transmission after the network reconnection.

Drawings

FIG. 1 is a block diagram of a conference shorthand system;

fig. 2 is a schematic diagram of audio waveforms.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments. The embodiments of the present invention have been presented for purposes of illustration and description, and are not intended to be exhaustive or limited to the invention in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art. The embodiment was chosen and described in order to best explain the principles of the invention and the practical application, and to enable others of ordinary skill in the art to understand the invention for various embodiments with various modifications as are suited to the particular use contemplated.

Example 1

The invention discloses an audio transmission method for a conference shorthand system, wherein the conference shorthand system is mainly composed of a conference shorthand terminal for recording conference audio, an ASR server for providing a speech recognition service, an NLP server for providing a natural language processing service, a collaborative editing server for providing a background support and a manual editing terminal for correcting conference records, as shown in figure 1, the conference shorthand terminal is respectively in bidirectional connection with the ASR server, the NLP server and the collaborative editing server, and the collaborative editing server is in bidirectional connection with the manual editing terminal.

The conference shorthand terminal is an independent device which is placed in a conference site and used for recording and preprocessing conference audio; the manual editing terminal is a desktop, a notebook or the like which is provided with specific software, and the specific software refers to software which can realize necessary functions.

The manual editing terminal and the conference shorthand terminal can be located at different places, for example, a conference is started in Beijing, and recording personnel corrects the conference records in Shanghai.

The connection mode among the conference shorthand terminal, the ASR server, the NLP server and the manual editing terminal can adopt but not limited to a wired network, a WiFi network and a 4G network.

The audio transmission method disclosed by the embodiment includes the steps that when a conference is in progress, conference audio is recorded through a conference shorthand terminal, audio streams are cut and numbered according to natural sentences, the time interval between cut audio segments is 0.00001ms, and then the cut audio segments are respectively sent to an ASR server and a collaborative editing server; when the conference shorthand terminal cuts the audio stream, recording the start time, the end time and the audio code of each audio segment and generating a log file; and the conference stenography terminal sends one or more audio characteristic information connected audio segments in the log file to the ASR server together.

The ASR server and the NLP server are both existing third party servers. The ASR server converts the content of the audio segment into a text, and the conversion process is mechanical conversion, wherein a great number of wrongly written characters (mostly homophone errors) exist; the NLP server automatically corrects the primary text according to the natural language, and the conversion process is a process of automatically correcting the primary text based on the habit of the natural language of human beings. The NLP server returns the secondary text of the conference shorthand terminal, the accuracy can reach 90-95%, but a certain error rate still exists.

The collaborative editing server is used for carrying out one-to-one correspondence on the audio segments and the secondary texts according to the log file.

The audio stream is cut in accordance with the natural sentence, which is one of the gist of the present embodiment.

The natural sentence in this embodiment refers to the sentence between adjacent pauses, such as "i am a thick and wild sound like the yellow river" in fig. 2, and "not loud in the building of the united nations". The audio stream is cut according to the natural sentence, so that the integrity of audio information can be ensured, and the audio data loss is prevented; and secondly, the bandwidth occupied in the audio sending process is reduced, the audio can conveniently and quickly reach the voice text conversion server, the audio jam caused by network traffic jam in the path sent to the voice text conversion server is reduced, the situation that the bicycles and battery cars, especially pedestrians, can shuttle from gaps of the automobiles is better than that on the jammed road, and the network transmission is the same.

When no audio fluctuations are detected for a period of time, the audio stream is cut and then processing continues after 0.00001 ms. The interval between audio segments is set to 0.00001ms in order to minimize audio loss and misalignment. For example, 5s audio contains an audio segment interval in the middle, and if the audio segment interval is 0.1ms, then on average, 1h audio will generate 72ms deviation, and 4h audio will generate 288ms deviation; if the audio segment interval is 0.00001ms, then on average, 1h of audio produces only 0.0072ms of deviation, and 4h of audio produces only 0.0288ms of deviation.

If the pause is not detected for a long enough time within 60s, the audio stream is cut forcibly, so that the audio segment is prevented from being too long and affecting the transmission speed of the audio segment and the response speed of the ASR server and the NLP server.

When the audio stream is cut to form audio segments, it is independent from the audio stream being generated, meaning that the segment of audio ends and that the segment of audio can be played back for modification of its corresponding text.

The two main points of this embodiment are that one or more audio feature information in the log file is sent to the ASR server together with the connected audio segment.

The start time and the end time of the audio segment are based on Beijing time. The start time and the end time of the audio segment and the corresponding audio code are information which can be acquired by the conference stenography terminal in the audio cutting process, but the text corresponding to the audio segment is a secondary text returned by the NLP server.

Ideally, a segment of audio corresponds to a segment of text and corresponds to the text in sequence, but there may be a possibility that a segment of audio does not correspond to a text, such as a situation of playing a song on site. This involves the problem of how to correspond the secondary text returned by the NLP server to the audio segments one-to-one. In this embodiment, a method for solving this problem is that, if the audio segment has no text corresponding to the audio segment, the conference stenography terminal marks the audio segment in the log file, the collaborative editing server makes one-to-one correspondence between the audio segment and the secondary text according to the log file, and if there is a mark in a certain audio segment, the audio segment is skipped over so as to avoid the problem that the text and the audio segment are in correspondence error. The conference shorthand terminal knows which audio segment has no corresponding text, and the judgment is carried out by data returned by the ASR server, for example, one or more of the start time, the end time and the audio number are fused to form characteristic information connected with the audio segment and sent to the ASR server, the ASR server returns a text carrying the characteristic information, and the conference shorthand terminal can know that the audio segment has no corresponding text and sends the text.

Numbering of audio segments is the third point of this embodiment.

In the transmission process of the audio segment and the text, the audio segment is large and the text is small, so the text is often transmitted to the collaborative editing server earlier than the audio segment, that is, the audio segment and the text are not transmitted to the collaborative editing server at the same time, and the collaborative editing server knows which piece of text corresponds to which piece of audio. In this embodiment, this problem is solved by numbering each piece of audio and text by the conference shorthand terminal.

Because the conference shorthand terminal, the ASR server, the NLP server, the collaborative editing server and the manual editing terminal are all connected through the network, the network interruption may occur in the conference process. When the conference stenographic terminal detects the network interruption, the data transmission to the ASR server/NLP server is stopped, the data are temporarily stored in the memory, when the network is connected again, the data are sequentially transmitted to the ASR server/NLP server through the memory, the phenomenon that after the network is reconnected, the ASR server/NLP server receives the audio data in a centralized mode, the situation that the audio data are attacked is mistakenly solved, and the connection between the conference stenographic terminal and the conference stenographic terminal is closed. In order to prevent the network disconnection between the conference shorthand terminal and the collaborative editing server, the backup conference audio is stored in the collaborative editing server. The backup conference audio can be used for correcting the conference record by calling the conference audio by the manual editing terminal after the conference is finished, but not necessarily correcting the conference record in the conference process; meanwhile, the problem that the manual editing terminal cannot acquire the audio information when transmission obstacles exist between the conference shorthand terminal and the collaborative editing server can be prevented.

In order to prevent the network disconnection between the conference shorthand terminal and the collaborative editing server, the backup conference audio is stored in the collaborative editing server. The backup conference audio can be used for correcting the conference record by calling the conference audio by the manual editing terminal after the conference is finished, but not necessarily correcting the conference record in the conference process; meanwhile, the problem that the manual editing terminal cannot acquire the audio information when transmission obstacles exist between the conference shorthand terminal and the collaborative editing server can be prevented.

It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by one of ordinary skill in the art and related arts based on the embodiments of the present invention without any creative effort, shall fall within the protection scope of the present invention.

Claims

1. The audio transmission method for the conference shorthand system is characterized in that when a conference is carried out, a conference shorthand terminal collects conference audio, cuts and numbers audio streams according to natural sentences, and respectively sends the cut audio segments to an ASR server, an NLP server and a collaborative editing server; the ASR server is used for converting the content of the audio segment into a primary text, the NLP server is used for automatically correcting the primary text according to the natural language based on the habit of the natural language of human and converting the primary text into a secondary text, and the collaborative editing server is used for positioning the audio;

when the conference shorthand terminal cuts the audio stream, recording the start time, the end time and the audio code of each audio segment and generating a log file; the conference shorthand terminal sends one or more audio characteristic information in the log file to the ASR server together with the audio segment; the time interval between cutting audio segments is 0.00001 ms.

2. The audio transmission method of claim 1, wherein the audio segment duration is limited to within 60 s.

3. The audio transmission method according to claim 2, wherein the audio stream is copied and transmitted to the collaborative editing server while the audio stream is cut by the conference stenographic terminal.

4. The audio transmission method according to claim 2, wherein when the conference shorthand terminal detects a network interruption, the sending of the data to the ASR server is stopped, the data is temporarily stored in the memory, and when the network is reconnected, the data is sent to the ASR server in order through the memory.