CN108647190B

CN108647190B - Method, device and system for inserting voice recognition text into script document

Info

Publication number: CN108647190B
Application number: CN201810377094.2A
Authority: CN
Inventors: 卢闪明; 张亚鹏; 李行; 单衍景
Original assignee: BEIJING HUAXIA DENTSU TECHNOLOGY CO LTD
Current assignee: BEIJING HUAXIA DENTSU TECHNOLOGY CO LTD
Priority date: 2018-04-25
Filing date: 2018-04-25
Publication date: 2022-04-29
Anticipated expiration: 2038-04-25
Also published as: CN108647190A

Abstract

The embodiment of the application discloses a method, a device and a system for inserting a voice recognition text into a record document, wherein the method for inserting the voice recognition text into the record document comprises the steps of receiving current text recognition information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length; and inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information. According to the technical scheme, under the scene that multiple roles speak simultaneously, the voice recognition server alternately returns real-time recognition texts with different roles, the text recognition content in the inserted text recognition information is correctly, orderly and separately inserted into the script document regardless of whether the text recognition content is confirmed or not, the script document is not inserted into the text recognition content in a confirmed state, the dynamic insertion effect is more obvious when the speed of recognizing the text to insert the document is improved, and the user experience is greatly improved.

Description

Method, device and system for inserting voice recognition text into script document

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a system for inserting a speech recognition text into a script document.

Background

With the development of speech recognition technology, speech recognition technology is more and more widely used in various industries. For example: in the court trial or meeting process, if the voice recognition technology can be applied to the court trial or meeting, the voice is converted into characters and the characters are inserted into the written document in different colors in real time, so that the workload of court trial or meeting recording personnel is greatly reduced, the problem of missing and misreading is avoided, and even the labor is saved by completely replacing the work of the recording personnel.

In the speech recognition process, the recognition server obtains an audio stream of a current certain role speaking, and generates a recognition text aiming at the current audio stream successively by repeatedly slicing the audio stream for multiple times and analyzing the audio stream in combination with context and semantics of context. If the text recognition content in the text recognition information cannot be confirmed, the recognition server repeatedly performs recognition processing on the current audio stream until the text recognition content in the text recognition information of the current audio stream is confirmed, and the text recognition content is not inserted into the record document. In the recognition process, if the speech speed of the speaker is too fast and the speech pause time is short, the recognition server will cause an error in automatic sentence-breaking calculation (the audio streams corresponding to two sentences of speech of the speaker are treated as one sentence), and since the recognition server performs comparison and analysis on the current audio stream for an increased number of times to obtain the recognition text of the final confirmation state, the user experience will be poor.

Disclosure of Invention

The embodiment of the application aims to provide a method, a device and a system for inserting a voice recognition text into a record document, and solve the technical problem that the existing record document inserting experience is poor.

In order to achieve the above object, an embodiment of the present application provides a method for inserting a speech recognition text into a transcript document, including:

receiving current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

and inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information.

Preferably, the step of inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information includes:

acquiring first text identification information of a first role, acquiring an insertion position of text identification content in the first text identification information through a positioning function, inserting the text identification content in the first text identification information of the first role into a corresponding position, and setting the first role as a line feed role;

acquiring first text identification information of a second role, acquiring an insertion position of text identification content in the first text identification information of the second role by taking a bookmark corresponding to a current line feed role as a reference, inserting the text identification content in the first text identification information of the second role into a corresponding position, updating the line feed role, and taking the second role as the line feed role;

acquiring second text identification information of a first role, if the text identification state identifier in the first text identification information is a non-confirmed identifier, acquiring the insertion position of the text identification content in the second text identification information of the first role by taking a bookmark used when the text identification content in the last text identification information of the first role is inserted as a reference, and inserting the text identification content in the second text identification information of the first role into a corresponding position without updating a line-changing role, wherein the second role is a line-changing role; if the text recognition state identifier in the first text recognition information is a confirmation identifier, obtaining the insertion position of the text recognition content in the second text recognition information of the first role by taking the bookmark corresponding to the current line-feed role as a reference, inserting the text recognition content in the second text recognition information of the first role into the corresponding position, updating the line-feed role, and taking the first role as the line-feed role;

acquiring second text identification information of a second role, if the current line-feed role is the first role and a text identification state identifier in first text identification information of the second role is a confirmation identifier, acquiring an insertion position of the text identification content in the second text identification information of the second role by taking a bookmark used when the text identification content in the second text identification information of the first role is inserted as a reference, inserting the text identification content in the second text identification information of the second role into a corresponding position, and updating the line-feed role; if the current line-feed role is a first role and the text recognition state identifier in the first text recognition information of the second role is a non-confirmed identifier, or the current line-feed role is a second role, the insertion position of the text recognition content in the second text recognition information of the second role is obtained by taking a bookmark used when the text recognition content in the first text recognition information of the second role is inserted as a reference, and the text recognition content in the second text recognition information of the second role is inserted into the corresponding position without updating the line-feed role;

acquiring first text identification information of other roles, taking a bookmark corresponding to the current line-feed role identification as a reference, acquiring insertion positions of text identification contents in the first text identification information of the other roles, inserting the text identification contents in the first text identification information of the other roles into corresponding positions, updating the line-feed role, and taking the other roles as the line-feed role.

Preferably, the step of inserting the text recognition contents in the text recognition information of each character into the corresponding position includes:

for each role, if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition identifier in the previous text recognition information is a non-confirmed identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

for each role, if the text recognition state identifier in the current text recognition information is a non-confirmation identifier and the text recognition identifier in the previous text recognition information is a confirmation identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document;

for each role, the text recognition state identifier in the current text recognition information is a confirmation identifier, and the text recognition identifier in the previous text recognition information is a non-confirmation identifier, and then the text recognition content of the current text information is inserted into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

for each role, the text recognition state identifier in the current text recognition information is a confirmation identifier, and the text recognition identifier in the last text recognition information is a confirmation identifier, and then the text recognition content of the current text recognition information is inserted into the corresponding position of the record document.

Preferably, the step of inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the last text recognition information and the text length and the text recognition content in the current text recognition information comprises:

comparing the content of the text recognition content of the current text recognition information from the starting position to the position with the same text length as the text recognition content of the previous text recognition information with the text recognition content of the previous text recognition information, if the comparison result is the same, removing the content of the text recognition content of the current text recognition information from the starting position to the position with the same text length as the text recognition information in the previous text recognition information, and inserting the residual content behind the text recognition content of the previous text recognition information in the record document; and if the comparison result is different, deleting the text identification content of the last text identification information, and inserting the text identification content of the current text identification information into the position of the text identification content of the last text identification information of the record document.

Preferably, the step of inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information further includes:

after inserting the text recognition content in the current text recognition information into the corresponding position, judging the text recognition state identification of the previous text recognition information, if the text recognition state identification of the previous text recognition information is a confirmation identification, removing the shading effect of the text recognition content of the previous text recognition information, inserting the text recognition content in the current text recognition information, and setting the shading effect; and if the text recognition state identifier of the last text recognition information is a non-confirmed identifier, inserting the text recognition content in the current text recognition information, and setting a shading effect.

receiving an audio stream;

segmenting the audio stream to obtain audio substreams;

determining a target audio substream needing to be identified currently according to the text identification state identifier in the previous text identification information;

identifying the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

and sending the current text identification information to a writing record inserting end to insert the text identification content in the current text identification information into the writing record document.

Preferably, the step of determining the target audio substream currently to be identified comprises:

if the text recognition state identifier in the previous text recognition information is a non-confirmed identifier, the target audio substream to be recognized currently is the audio substream corresponding to the previous text recognition information;

and if the text recognition state identifier in the last text recognition information is the confirmation identifier, the target audio substream needing to be recognized currently is the next audio substream.

In order to achieve the above object, an embodiment of the present application provides an apparatus for inserting a speech recognition text into a transcript document, including:

a receiving unit for receiving current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

and the inserting record unit is used for inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information.

Preferably, the inserting record unit includes:

the first character recognition information insertion module is used for acquiring first character recognition information of a first character, acquiring the insertion position of text recognition content in the first text recognition information through a positioning function, inserting the text recognition content in the first text recognition information of the first character into a corresponding position, and setting the first character as a line-changing character;

the first text identification information insertion module of the second role is used for acquiring first text identification information of the second role, acquiring the insertion position of text identification content in the first text identification information of the second role by taking a bookmark corresponding to the current line feed role as a reference, inserting the text identification content in the first text identification information of the second role into a corresponding position, updating the line feed role and taking the second role as the line feed role;

the second text identification information insertion module of the first role is used for acquiring second text identification information of the first role, if the text identification state identifier in the first text identification information is a non-confirmed identifier, the insertion position of the text identification content in the second text identification information of the first role is acquired by taking a bookmark used when the text identification content in the last text identification information of the first role is inserted as a reference, the text identification content in the second text identification information of the first role is inserted into a corresponding position, a line-changing role does not need to be updated, and the second role is a line-changing role; if the text recognition state identifier in the first text recognition information is a confirmation identifier, obtaining the insertion position of the text recognition content in the second text recognition information of the first role by taking the bookmark corresponding to the current line-feed role as a reference, inserting the text recognition content in the second text recognition information of the first role into the corresponding position, updating the line-feed role, and taking the first role as the line-feed role;

a second text identification information insertion module of a second role, configured to obtain second text identification information of the second role, and if the current line-feed role is the first role and a text identification state identifier in the first text identification information of the second role is a confirmation identifier, obtain an insertion position of text identification content in the second text identification information of the second role with reference to a bookmark used when text identification content in the second text identification information of the first role is inserted, insert the text identification content in the second text identification information of the second role into a corresponding position, and update the line-feed role; if the current line-feed role is a first role and the text recognition state identifier in the first text recognition information of the second role is a non-confirmed identifier, or the current line-feed role is a second role, the insertion position of the text recognition content in the second text recognition information of the second role is obtained by taking a bookmark used when the text recognition content in the first text recognition information of the second role is inserted as a reference, and the text recognition content in the second text recognition information of the second role is inserted into the corresponding position without updating the line-feed role;

and the first text identification information insertion module of other roles is used for acquiring the first text identification information of other roles, acquiring the insertion positions of the text identification contents in the first text identification information of other roles by taking the bookmark corresponding to the current line-feed role identifier as a reference, inserting the text identification contents in the first text identification information of other roles into corresponding positions, updating the line-feed role, and taking other roles as the line-feed role.

a receiving unit for receiving an audio stream;

the segmentation unit is used for segmenting the audio stream to obtain an audio substream;

the target audio substream confirming unit is used for confirming the current target audio substream needing to be identified according to the text identification state identifier in the previous text identification information;

the identification unit is used for identifying the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

and the sending unit is used for sending the current text identification information to a record inserting end to realize that the text identification content in the current text identification information is inserted into the record document.

Therefore, compared with the prior art, according to the technical scheme, under the scene that multiple roles speak simultaneously, the voice recognition server returns the real-time recognition texts with different roles in a crossed manner, the text recognition content in the inserted text recognition information is correctly, orderly and separately inserted into the script document regardless of whether the text recognition content is confirmed or not, the script document is not inserted only when the text recognition content is in a confirmed state, the dynamic insertion effect is more obvious when the speed of inserting the recognition text into the document is improved, and the user experience is greatly improved. And moreover, the problem of background color effect is dynamically added, and the application range of the application scene expanding technology of the speech recognition real-time recognition text insertion technology is enlarged.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of a system for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

fig. 3 is a second flowchart of a method for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

FIG. 4 is a functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

FIG. 5 is a second functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present application;

fig. 6 is a schematic view of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.

Fig. 1 is a schematic diagram of a system for inserting a speech recognition text into a transcript document according to an embodiment of the present application. The method comprises the following steps: and inserting a writing terminal and a voice recognition server. The voice recognition server acquires an audio stream from the voice collector, and segments the audio stream into a plurality of audio substreams after the audio stream is subjected to noise processing. The voice recognition server carries out recognition processing on each audio substream, constructs a recognition processing result into text recognition information, and sends the text recognition information to the inserting record terminal regardless of whether the recognition content of the audio substream is confirmed or not. If the recognition content of the currently recognized audio substream is confirmed, the speech recognition server can perform the recognition work of the next audio substream. And if the identification content of the currently identified audio substream is in an unconfirmed state, the voice identification server continues to identify the current audio substream. The voice recognition server returns the text recognition information to the insertion script terminal regardless of whether the recognized contents of the audio substream are in an unconfirmed state or a confirmed state. And the inserting record terminal inserts the text recognition content in the text recognition information returned by the voice recognition server into the corresponding position of the record document according to the role identification and the text recognition state identification.

In the technical scheme, the recognition server is brought back to distinguish the unique identifier of each role every time the recognition result is recognized, the handwriting terminal is inserted to dynamically create and maintain the voice content storage unit of each role according to each role identifier, the storage unit stores information such as voice content, recognition state identifiers and the like, and when a real-time recognition text is received every time, the handwriting terminal is inserted to dynamically acquire the recognition state identifiers in the storage unit corresponding to each role through the brought-back role identifiers to calculate and acquire the text insertion position and distinguish the roles, so that the correct and ordered corresponding positions of the recognition text content inserted into the handwriting document can be realized at the same time when multiple roles speak simultaneously or a single role speaks simultaneously.

In the present embodiment, the meaning of identifying the unconfirmed state of the content is: the voice recognition server carries out slicing analysis and other recognition operations on the acquired audio stream to generate text recognition content, the text recognition content is a part of final text generated by current audio substream recognition, and individual fields stored in the text recognition content need to be corrected and modified through re-recognition processing. The meaning of recognizing the confirmation status of the content is: the recognition server carries out slice analysis and other recognition operations on the acquired audio stream to generate text recognition content, and the text recognition content finally confirms the text which does not need to be subjected to recognition operations again by combining context semantic analysis.

Based on the above description, an embodiment of the present application provides a method for inserting a speech recognition text into a script document, as shown in fig. 2. For the technical scheme, the method and the device are applied to the terminal for inserting the record, and specifically, the terminal for inserting the record can be, for example, a desktop computer, a tablet computer, a notebook computer, a smart phone, a digital assistant, an intelligent wearable device, a shopping guide terminal, a television and the like with a data processing function. Alternatively, the client may be software capable of running in the electronic device. The method is applied to a multi-role simultaneous speaking situation, and can comprise the following steps:

step 201): receiving current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length.

In the technical scheme, the text recognition content is the voice content in the current target audio substream. The text recognition state identification identifies whether the recognized voice content in the current target audio substream does not need to be recognized again. In this embodiment, the text recognition status is labeled as 1, which represents that the speech content recognized in the current audio substream is finally confirmed by combining with the context semantic analysis to a text that does not need to be recognized again. The text recognition state is marked as 0, the voice content recognized in the current audio substream is marked as a part of the final text generated by recognizing the current audio substream, and individual fields stored in the text recognition content need to be corrected and modified through re-recognition processing. The role identification is the identification set by the identification server aiming at different roles, so that the voice contents of different roles can be conveniently classified into corresponding roles. The text length is the length of the speech content of the current target audio substream identified by the recognition server.

In this embodiment, the terminal inserted into the script is provided with a storage unit on the processor, which is specially used for storing the current text recognition information returned by the voice recognition server. The storage unit is divided into a plurality of storage areas, and different contents in the text identification information are respectively stored in different areas. For the technical scheme, a storage unit stores previous text identification information, an inserting record terminal receives current text identification information returned by a voice recognition server, the inserting record terminal inserts text identification content in the current text identification information into a corresponding record document according to the current text identification information and the previous text identification information, deletes the previous text identification information in the storage unit, and stores the current text identification information in the storage unit. The insertion type terminal is provided with a memory for storing the result information of the insertion in the type document, and the above-described storage unit stores the content in the last text identification information for accurately confirming the insertion position when the text identification content is inserted.

Step 202): and inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information.

In this technical solution, the step of inserting the corresponding text recognition content into the corresponding position of the bibliographic document according to the text recognition state identifier and the role identifier of the current text recognition information includes:

Specifically, to describe in detail the process of inserting the script document when multiple roles speak simultaneously, take the example of three roles speaking in a certain application scene, where two roles speak simultaneously and the third role speaks at a certain time. The processing flow of inserting the record terminal is as follows:

1. roles A, B speak for the first time at the same time.

2. The character a recognized text is first returned to the text recognized contents Sa1, the text recognition status identifies Ta1, and the inserted text recognized contents Sa1 while setting a as a line feed character (LastRole ═ a).

3. The character B recognition text is returned to the text recognition content Sb1 and the text recognition state flag Tb1 for the first time. Based on the bookmark of the current line feed character (LastRole) a, the insertion position of the text recognition content Sb1 of the character B in the document (the next line of the character a) is calculated, and the corresponding text of the character B is inserted and the line feed character B is updated (LastRole ═ B).

4. The character a recognized text returns the text recognition content Sa2 for the second time, and the text recognition status flag Ta 2.

4.1Ta1 ═ 0, text insertion is accomplished by role a bookmark calculation of text insertion position without updating the line feed role (LastRole ═ B).

4.2Ta1 is equal to 1, and based on the current line feed character (LastRole) B bookmark, the insertion position (next line of character B) of the acquired character a text recognition content Sa2 in the document is calculated, and the corresponding text of character a is inserted while the line feed character a is updated (LastRole is equal to a).

5. The character B recognized text returns the text recognition content Sb2, the text recognition state flag Tb2 for the second time.

5.1 if LastRole is a and Tb1 is 1, the insertion position of the text recognition content Sb2 is calculated based on the role a bookmark, and the line feed role is updated to role B (LastRole is B).

5.2 if LastRole is a and Tb1 is 0, the insertion position of the text recognition content Sb2 is calculated based on the bookmark of the character B, and the line feed character does not need to be changed.

5.3 if LastRole is B, the insertion position of the text recognition content Sb2 is calculated based on the role B bookmark, and it is not necessary to change the line feed role.

6. If the new role C speaks for the first time, no matter the current line feed role is A or B, the role C is calculated by taking the bookmark corresponding to the current line feed role as a reference to return to the text insertion position, and the line feed role is updated to be C (LastRole ═ C).

In this embodiment, the step of inserting the text recognition content in the text recognition information of each character into the corresponding position includes:

In this embodiment, the step of inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the last text recognition information and the text length and the text recognition content in the current text recognition information includes:

Specifically, for each role in the multi-role, the inserted logic flow is:

1. the character A speaks for the first time, the audio collector conducts audio collection on the speech of the character A to obtain audio streams, the recognition server conducts segmentation processing on the audio streams to obtain audio substreams, the recognition server conducts recognition processing on the first audio substream, text recognition content Sa1, text recognition state identification Ta1 and text length L1 are returned for the first time, a storage unit corresponding to the character A is created, and the text recognition content, the text recognition state identification and the text length in the text recognition information are stored. Wherein Ta1 is 1, which indicates that the recognition server is a confirmation text for the currently returned text recognition content Sa1, stores the text recognition content Sa1 and the text recognition state identification Ta1 in the currently returned text recognition information in the storage unit, and inserts the text recognition content Sa1 into the transcript document. And waiting for the recognition server to return the next text recognition information. Ta1 is 0 indicating that the recognition server has unconfirmed text of the currently returned text recognition content Sa1 bits, stores the text recognition content Sa1 and the text recognition state identification Ta1 in the currently returned text recognition information in the storage unit, and inserts the text recognition content Sa1 into the bibliographic document. And waiting for the recognition server to return the next text recognition information.

2. The recognition server returns a text recognition message, and obtains the text recognition content Sa2, the text recognition state identifier Ta2 and the text length L2. If the text recognition status identifier Ta1 of the last text recognition information is 1, the current text recognition information is obtained by the recognition server for the current audio substream recognition. If the text recognition status identifier Ta1 of the previous text recognition information is 0, the current text recognition information is obtained by the recognition server for the audio substream corresponding to the previous text recognition information.

2.1 if Ta2 is 0 and the recognized text state in the text recognition information is not confirmed, comparing the L1 length content starting from the start position of the text recognition content S2 with the text recognition content S1, and if the comparison result is the same, acquiring the L2-L1 content portion S21 of the text recognition content S2 by character string truncation and inserting the content S21 into the transcript document in a tail-added manner; if the comparison results are not the same, the text recognition content S2 is not cut, and the text recognition content S2 is directly inserted into the bibliographic document in an overlay insertion manner (deleting the text recognition content S1 and inserting the text recognition content S2). The contents stored in the storage unit are updated, the text recognition content Sa1, the text recognition state flag Ta1, and the text length L1 of the last text recognition information are deleted, and the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2 of the current text recognition information are stored in the storage unit.

2.2 if Ta2 is 1, the status of the recognized text in the text recognition information is ok (Ta2 is 1), the text length is L2, and the text recognition content is S2. The contents starting from the start position of the text recognition content S2 in the currently acquired text recognition information to the length L1 are compared with the text content S1. If the comparison result is the same, acquiring partial contents of L2-L1 of the text content S2 through character string interception and inserting the contents into the script document in a tail adding mode; if the comparison results are not the same, the text content S2 does not need to be intercepted, and the text content S2 is directly inserted into the record document in a covering and inserting manner (S1 content in the deleted document is inserted into S2). Meanwhile, the text recognition content Sa1, the text recognition state flag Ta1, and the text length L1 of the previous text recognition information are deleted, and the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2 of the current text recognition information are stored in the storage unit.

3. And returning the third text identification information by the identification server, receiving the information by the inserting record terminal, and executing the inserting logic according to the step 2.1 in the processing flow of the inserting record terminal if the text identification state identifier Ta2 in the last text identification information is 0 regardless of 1 or 0 of the text identification state identifier Ta3 in the third text identification information. If the text recognition status flag Ta2 in the previous text recognition information is equal to 1, the text processing and insertion logic is restarted according to the sequence of step 1 in the process flow of inserting the script terminal. Finally, the text recognition content Sa2, the text recognition state flag Ta2, and the text length L2 of the previous text recognition information are deleted, and the text recognition content Sa3, the text recognition state flag Ta3, and the text length L3 of the current text recognition information are stored in the storage unit.

When the text recognition content is inserted into the script document, a shading effect is added to a new text of the script document inserted into each role in real time, the shading effect of a text with a recognition state being a confirmed state in the text recognition content returned last time in the document is detected and removed, and the shading effect is ensured to follow the current latest inserted recognition text. Specifically, the step of inserting the corresponding text recognition content into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information further includes:

In this embodiment, the logic flow of identifying text insertion in real time under the condition of simultaneous multi-role speaking is as follows:

1. roles A, B, C speak simultaneously.

2. The recognition text after the role A/B/C is processed returns the text recognition content Sa1/Sb1/Sc1 and the role identification A/B/C for the first time.

2.1 the text recognition status flag Ta1/Tb1/Tc1 is 0, according to the character flag a/B/C, the insertion position PA1/PB1/PC1 is obtained by WordAPI providing a positioning function calculation, the shading effect is added after the text recognition content is inserted, and the content stored in the storage unit corresponding to each character is updated.

2.2 the text recognition status flag Ta1/Tb1/Tc1 is equal to 1, the operation of step 2.1 is performed, and the next text recognition information insertion flow of the subsequent corresponding character is performed directly from step 3.

3. The character A/B/C processed recognized text returns the text recognition content Sa2/Sb2/Sc2 and the character identification A/B/C for the second time.

3.1 the text recognition status flag Ta2/Tb2/Tc2 is 0, according to the character flag a/B/C, the insertion position PA2/PB2/PC2 is acquired through the bookmark corresponding to the corresponding character, after the text recognition content is inserted, the shading effect is added, and the content stored in the storage unit corresponding to each character is updated.

3.2 the text recognition status flag Ta1 is equal to 1, step 3.1 is executed, and the next text recognition information insertion flow of the subsequent corresponding role is executed directly from step 3.

4. The recognition text after the role A/B/C is processed returns the text recognition content Sa3/Sb3/Sc3 and the role identification A/B/C for the third time.

If the text recognition state flag Ta3/Tb3/Tc3 is still 0, the flow of step 3 continues to be executed until Ta3/Tb3/Tc3 is 1, and the multi-character first-utterance text insertion is completed.

On the basis, adding logic of shading effects corresponding to the text recognition contents of each character under the condition that the multiple characters speak simultaneously:

1. multiple personas A, B, C speak for the first time at the same time.

2. The character A recognizes that the text returns to the text recognition content Sa1 for the first time, the text recognition state identification Ta1, the text is inserted into the script document, and the corresponding shading effect is set.

3. The character B recognition text is returned to the text recognition content Sb1 for the first time, and the text recognition state flag Tb 1.

3.1 the text recognition state flag Ta1 is 0, the text recognition content Sb1 is normally inserted, and the shading effect corresponding to the text recognition content Sb1 is set.

3.2 the text recognition status flag Ta1 is equal to 1, the shading effect of the text recognition content Sa1 is removed, the text recognition content Sb1 is inserted, and the shading effect corresponding to the text recognition content Sb1 is set.

4. The character a recognized text returns to the text recognition content Sa2 for the second time, and the text recognition state flag Ta 2.

4.1 the text recognition state flag Tb1 is 0, the inserted text recognition content Sa2 is normally calculated (the shading effect is added additionally to the tail of the text recognition content Sa1 or is added in place of the text recognition content Sa1 completely).

4.2 the text recognition state flag Tb1 is 1, the text recognition content Sb1 shading effect is removed, and the inserted text recognition content Sa2 is normally calculated (the shading effect is added additionally to the tail of the text recognition content Sa1 or is added in full replacement of the text recognition content Sa 1).

Judging a text recognition state identifier in the previous text recognition information every time when the text recognition content in the new text recognition information is inserted, if the identifier is confirmed, removing the shading effect of the text recognition content in the previous text recognition information, and normally inserting the text recognition content in the current text recognition information; if the identification is not confirmed, the text recognition content in the current text recognition information is normally inserted, and the shading effect of the text recognition content in the previous text recognition information does not need to be eliminated. And when the task is closed, the system clears the bookmarks and shading effects corresponding to all the roles in the text.

The embodiment of the application provides another method for inserting the speech recognition text into the script document, as shown in fig. 3. For the technical scheme, the method is applied to a voice recognition server, and specifically, the voice recognition server can be an electronic device with data operation and storage functions and network interaction functions; software may also be provided that runs in the electronic device to support data processing, storage, and network interaction. The number of servers is not particularly limited in the present embodiment. The server may be one server, several servers, or a server cluster formed by several servers. The method for inserting the voice recognition text into the script document can comprise the following steps:

step 301): an audio stream is received.

In this embodiment, the voice collector collects voice of a user in an application scene in real time, and performs noise reduction processing on the collected voice to obtain an audio stream.

Step 302): and segmenting the audio stream to obtain audio substreams.

In this embodiment, in order to improve the accuracy of speech recognition, the audio stream fed back by the speech acquisition device is subjected to segmentation processing, and a large-segment audio stream is subjected to segmentation processing to obtain a plurality of small-segment audio streams. The data of the audio stream is not particularly large during each recognition, and the recognition precision is greatly improved.

Step 303): and determining the current target audio substream needing to be identified according to the text identification state identifier in the last text identification information.

In the technical scheme, if the recognition server cannot confirm the recognition result of the current audio information needing to be recognized, the recognition result is still fed back to the insertion record terminal, the unconfirmed content is inserted into the record document, then the recognition server continues to recognize the audio information again, whether the recognition result is confirmed or not at this time, the recognition result is still fed back to the insertion record terminal, and the recognition result is inserted into the record document. And the next audio information is not identified until the text identification information of the audio information identified and processed by the identification server is confirmed. And if the identification result of the identification server to the current audio information needing identification processing is in a confirmation state, feeding back the identification result to the insertion record terminal, inserting the confirmed content into the record document, and then identifying the next audio information by the identification server.

For a conventional technical scheme, the recognition result of the current audio information to be recognized cannot be confirmed by the recognition server, and the recognition result is not fed back to the insertion record terminal until the recognition server confirms the current audio information to be recognized result, and the recognition result is fed back to the insertion record terminal for insertion. The time spent in the insertion of the conventional technical scheme is longer than that of the technical scheme, and the experience degree of a user is greatly reduced. According to the technical scheme, the identification information is inserted into the record document in real time regardless of confirmation, so that the user experience is improved. Therefore, in the present technical solution, the step of determining the target audio substream that needs to be identified currently includes:

Step 304): identifying the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

step 305): and sending the current text identification information to a writing record inserting end to insert the text identification content in the current text identification information into the writing record document.

According to the technical scheme, the problem of poor user experience due to low returning speed of the recognition text is solved by inserting the text generated in the process of slicing, comparing, analyzing and calculating the audio stream by the recognition server into the document in real time regardless of confirmation. Meanwhile, in the voice recognition process, under the condition that a single role speaks successively, the recognition text returned by the recognition server in real time can be processed and inserted sequentially according to the roles, but the condition that multiple roles speak simultaneously may occur in the court trial or conference process, under the condition, the recognition server simultaneously slices the audio streams corresponding to the roles which speak at the same time in a concurrent mode, and returns the roles to the real-time recognition text in a crossed manner according to the concurrent processing speed, under the condition, if the single role speaks successively logic is still inserted, the problems of inserting text sequence and disorganizing roles occur, further the problem of disordered background color effect dynamic addition effect is caused, and finally the generated stroke record/conference document loses meaning and is wasted. Based on the technical scheme, the recognition server returns the recognition text every time and brings back the unique identification of the role, the current text recognition state identification corresponding to each role is dynamically obtained through the brought-back role identification when the real-time recognition text is received every time according to each role identification, the text insertion position is obtained through calculation according to the identifications and the roles are distinguished, and the problems of multi-role simultaneous speaking text insertion position, role distinguishing and shading effect adding disorder are solved.

Fig. 4 is a functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present application. The device is used for inserting a record terminal in practical application. The method comprises the following steps:

a receiving unit 401, configured to receive current text identification information of a target audio substream; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

and an inserting record unit 402, configured to insert corresponding text recognition content into a corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information.

In this embodiment, the inserting record unit includes:

Fig. 5 is a second functional block diagram of an apparatus for inserting a speech recognition text into a transcript document according to an embodiment of the present invention. The device is used for inserting a record terminal in practical application. The method comprises the following steps:

a receiving unit 501 for receiving an audio stream;

a segmentation unit 502, configured to segment the audio stream to obtain an audio substream;

a target audio substream confirming unit 503, configured to determine a target audio substream that needs to be identified currently according to the text identification state identifier in the previous text identification information;

an identifying unit 504, configured to identify the target audio substream to obtain current text identification information; the current text information comprises text recognition content, a text recognition state identifier, a role identifier and a text length;

a sending unit 505, configured to send the current text identification information to an insertion entry end, so as to implement insertion of the text identification content in the current text identification information into the entry document.

Fig. 6 is a schematic diagram of an electronic system according to an embodiment of the present application. The electronic device includes: a memory a and a processor b, wherein the memory a stores a computer program, and the computer program realizes the following functions when being executed by the processor b:

In this embodiment, when the computer program is executed by the processor b, the following functions are implemented, in which the corresponding text recognition content is inserted into the corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information:

In this embodiment, the text recognition content in the text recognition information of each character is inserted into the corresponding position, and when the computer program is executed by the processor b, the following functions are implemented:

In this embodiment, the computer program, when executed by the processor b, implements the following functions when inserting the text recognition content of the current text recognition information into the corresponding position of the entry document according to the text length and the text recognition content in the last text recognition information and the text length and the text recognition content in the current text recognition information:

comparing the content of the text recognition content of the current text recognition information from the starting position to the position with the same text length as the text recognition content of the previous text recognition information with the text recognition content of the previous text recognition information, if the comparison result is the same, removing the content of the text recognition content of the current text recognition information from the starting position to the position with the same text length as the text recognition information in the previous text recognition information, and inserting the residual content behind the text recognition content of the previous text recognition information in the record document; and if the comparison result is different, inserting the text identification content of the current text identification information into the position of the text identification content of the last text identification information of the record document, and deleting the text identification content of the last text identification information.

An embodiment of the present application provides another electronic device, where the electronic device includes: a memory a and a processor b, wherein the memory a stores a computer program, and the computer program realizes the following functions when being executed by the processor b:

receiving an audio stream;

segmenting the audio stream to obtain audio substreams;

In this embodiment, a target audio substream currently to be identified is determined, and the computer program, when executed by the processor b, implements the following functions:

In this embodiment, the Memory includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card).

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

The specific functions implemented by the memory and the processor of the electronic device provided in the embodiments of the present disclosure may be explained in comparison with the foregoing embodiments in the present disclosure, and can achieve the technical effects of the foregoing embodiments, and thus, no further description is provided herein.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbyscript Description Language (vhr Description Language), and the like, which are currently used by Hardware compiler-software (Hardware Description Language-software). It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

Those skilled in the art will also appreciate that, in addition to implementing a client, server as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the client, server are in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a client and a server may be regarded as a hardware component, and a device included therein for implementing various functions may be regarded as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the client, reference may be made to the introduction of the embodiments of the method described above for a comparative explanation.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A method for inserting a voice recognition text into a script document is applied to a terminal for inserting scripts, and comprises the following steps:

receiving current text identification information of a target audio substream; the current text identification information comprises text identification content, a text identification state identifier, a role identifier and a text length;

inserting corresponding text recognition content into a corresponding position of the record document according to the text recognition state identifier and the role identifier of the current text recognition information, wherein the method comprises the following steps:

2. The method of claim 1, wherein the step of inserting the text recognition contents in the text recognition information of each character into the corresponding position comprises:

for each role, if the text recognition state identifier in the current text recognition information is a non-confirmed identifier and the text recognition state identifier in the previous text recognition information is a non-confirmed identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

for each role, if the text recognition state identifier in the current text recognition information is a non-confirmation identifier and the text recognition state identifier in the previous text recognition information is a confirmation identifier, inserting the text recognition content of the current text recognition information into the corresponding position of the record document;

for each role, the text recognition state identifier in the current text recognition information is a confirmation identifier, and the text recognition state identifier in the previous text recognition information is a non-confirmation identifier, and then the text recognition content of the current text recognition information is inserted into the corresponding position of the record document according to the text length and the text recognition content in the previous text recognition information and the text length and the text recognition content in the current text recognition information;

for each role, the text recognition state identifier in the current text recognition information is a confirmation identifier, and the text recognition state identifier in the last text recognition information is a confirmation identifier, and then the text recognition content of the current text recognition information is inserted into the corresponding position of the record document.

3. The method of claim 2, wherein the step of inserting the text recognition contents of the current text recognition information into the corresponding position of the bibliographic document based on the text length and the text recognition contents of the previous text recognition information and the text length and the text recognition contents of the current text recognition information comprises:

4. The method of claim 1, wherein the step of inserting corresponding text recognition contents into corresponding positions of the transcript document according to the text recognition state identifier and the role identifier of the current text recognition information further comprises:

5. The method of claim 1, comprising the steps of, applied to a speech recognition server:

receiving an audio stream;

segmenting the audio stream to obtain audio substreams;

identifying the target audio substream to obtain current text identification information; the current text identification information comprises text identification content, a text identification state identifier, a role identifier and a text length;

6. The method of claim 5, wherein the step of determining a target audio substream currently in need of identification comprises:

7. An apparatus for inserting a voice recognition text into a script document, comprising a script insertion terminal, comprising:

a receiving unit for receiving current text identification information of a target audio substream; the current text identification information comprises text identification content, a text identification state identifier, a role identifier and a text length;

the inserting record unit is used for inserting corresponding text recognition content into a corresponding position of a record document according to the text recognition state identifier and the role identifier of the current text recognition information;

the insertion writing unit includes:

8. The apparatus of claim 7, further comprising a speech recognition server comprising:

a receiving unit for receiving an audio stream;

the identification unit is used for identifying the target audio substream to obtain current text identification information; the current text identification information comprises text identification content, a text identification state identifier, a role identifier and a text length;