CN113076747A

CN113076747A - Voice recognition recording method based on role recognition

Info

Publication number: CN113076747A
Application number: CN202110346865.3A
Authority: CN
Inventors: 黄星耀; 熊倩; 王宇骁; 王枫; 王学春; 张志亮
Original assignee: Chongqing Fengyun Jihui Intelligent Technology Co ltd
Current assignee: Chongqing Fengyun Jihui Intelligent Technology Co ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-07-06

Abstract

The invention relates to the technical field of voice recognition, in particular to a voice recognition recording method based on role recognition, which comprises the following steps: s1, collecting speaking voices of a talker and a talked person in the talking process in real time; s2, converting the speaking voice of the talker into a first text and converting the speaking voice of the talked into a second text; s3, identifying the wrong entries in the first text and the second text, and replacing the wrong entries according to preset keyword entries; s4, detecting the voice frequency of the speaking voice of the person to be conversed, and marking the corresponding position of the voice frequency in the second text, which is not in the preset frequency range; and S5, carrying out the hearing back on the speaking voice of the talker and the talked person, and carrying out the proofreading on the first text and the second text. The invention solves the technical problem that the prior art can not identify whether the psychological impedance appears in the talked person.

Description

Voice recognition recording method based on role recognition

Technical Field

The invention relates to the technical field of voice recognition, in particular to a voice recognition recording method based on role recognition.

Background

At present, for a basic court, the work load of simultaneously making inquiries and notes is large, and the recording, checking and rechecking procedures are complex. With the development of the voice recognition technology, in the court trial or meeting process of a court, the voice recognition technology can be adopted to convert the voice into characters, and the characters are inserted into the written document in real time in a role-divided manner, so that the workload of court trial or meeting recording personnel is reduced, and the problem of missing and wrong notes is avoided.

For example, chinese patent CN110751950A discloses a police talking voice recognition method and system based on big data, wherein the method comprises the steps of: completing the configuration of the identities of the talker and the talked and the parameter setting of the audio acquisition equipment, completing the control of the audio acquisition equipment, and automatically generating a talking file directory; displaying the text content of the talking voice recognition in real time through a talking interface display window, and storing a talking voice file and a text file; analyzing speech of a conversation through big data, designing a psychological state rating table, analyzing a psychological state index of a conversation object, and generating a big data analysis report; and associating the talking voice file and the text file into the judicial system to finish the sharing of the talking voice file and the text file.

In the technical scheme, the working strength of the working policeman is reduced, and the working efficiency is improved. However, during the course of a conversation, the person being conversed often presents psychological impedance. Psychological impedance refers to the fact that during conversation, the talker negates the analysis of the talker in a public or hidden manner, delays, and counteracts the request of the talker, thereby affecting the normal progress of the conversation and even making the conversation difficult. When the talked-about presents psychological impedance, the credibility of the content of the talking is greatly reduced, and the attention of the clerk should be reminded. With the prior art, it is impossible to recognize whether or not the person being talked presents psychological impedance.

Disclosure of Invention

The invention provides a voice recognition recording method based on role recognition, which solves the technical problem that whether a talked person has psychological impedance or not cannot be recognized in the prior art.

The basic scheme provided by the invention is as follows: the voice recognition recording method based on role recognition comprises the following steps:

s1, collecting speaking voices of a talker and a talked person in the talking process in real time;

s2, converting the speaking voice of the talker into a first text and converting the speaking voice of the talked into a second text;

s3, identifying the wrong entries in the first text and the second text, and replacing the wrong entries according to preset keyword entries;

s4, detecting the voice frequency of the speaking voice of the person to be conversed, and marking the corresponding position of the voice frequency in the second text, which is not in the preset frequency range;

and S5, carrying out the hearing back on the speaking voice of the talker and the talked person, and carrying out the proofreading on the first text and the second text.

The working principle and the advantages of the invention are as follows:

(1) during conversation, if the person being conversed presents psychological impedance, it is usually manifested as silence (e.g., refusal to answer questions, or long pauses), whisper (e.g., answer with phrases, simple sentences, and spoken Buddhist), or excrescence (e.g., taurulation to speak with little chance, avoid certain core questions, divert attention), which corresponds to a frequency that is too low or too high compared to normal speech. In this way, the psychological impedance of the talker can be identified in real time, so as to label the corresponding conversation content, thereby prompting the clerk that the reliability of the conversation content may be problematic.

(2) During the case proceeding process, the real-time speaking voice of the talker and the talked can be converted into the first text and the second text in real time according to the role of the talker, and the wrong terms in the second text can be replaced according to the key terms, so that the clerk can listen back to the speaking voice of the talker and the talked to perform proofreading conveniently. By the mode, the manufacturing quality of case notes can be improved, and the working pressure of case handling personnel is reduced.

The invention can identify the psychological impedance of the talker in real time and label the corresponding conversation content, thereby prompting the credibility of the conversation content of the case handling personnel to have problems.

Further, in S1, the array microphone is used to collect the speaking voice of the talker and the talked during the conversation.

Has the advantages that: the microphone adopts an array design, so that one microphone can distinguish two roles, and is safe and stable; meanwhile, the array design can distinguish fuzzy sound and noise and improve the voice recognition distance.

Further, in S1, synchronous video of the talker and the talked during the talking process is collected; in S5, the first text and the second text are collated with the synchronous recording of the talker and the talked during the talking.

Has the advantages that: the conversation process is recorded in real time in a whole-course synchronous video recording mode, so that the judicial public trust is improved; meanwhile, the first text and the second text are corrected by combining the synchronous record, so that the accuracy of the case handling record is improved.

Further, in S2, the speaking voice of the talker is converted into the first text, and the first text is synchronously displayed; and converting the speaking voice of the talked into a second text, and synchronously displaying the second text.

Has the advantages that: by the mode, the display can be synchronously performed while the conversion is performed, so that the on-site verification is facilitated, and the real-time supervision of the conversation process is facilitated.

Further, in S2, the speaking voice of the talker is converted into a first text, and the first text is broadcasted by synchronous voice; and converting the speaking voice of the talked into a second text, and synchronously voice-broadcasting the second text.

Has the advantages that: through the mode, the first text and the second text are synchronously broadcasted in voice, so that on-site prompt of a clerk to check is facilitated.

Further, in S3, the dialect terms in the first text and the second text are identified, and the dialect terms are replaced according to the predetermined mandarin terms.

Has the advantages that: through the mode, dialects can be replaced by corresponding Mandarin, and the case writing book can be read and understood by case handling personnel conveniently.

Drawings

Fig. 1 is a flowchart of an embodiment of a speech recognition recording method based on character recognition according to the present invention.

Detailed Description

The following is further detailed by the specific embodiments:

example 1

An embodiment is substantially as shown in figure 1, comprising:

In this embodiment, a voice recognition server is adopted, and the product parameters are specifically as follows: the system version is centros 6.7, the CPU type is strong lntel (R) Xeon (R), the CPU model is Xeon D-1521, the CPU frequency is 2.40GHz, the CPU core is 4 cores and 8 threads, the memory type is DDR4, the memory capacity is 64B, the capacity of the hard disk is 250GB SSD, the network interface is 1 kilomega network port, the HDMI output interfaces are 1, the number of power supplies is 1, and the power supply power is 80W.

The specific implementation process is as follows:

first, speech of a talker and a talked during a conversation is collected in real time. In the present embodiment, the array microphone is used to collect the speaking voice of the talker and the talked during the conversation. Specifically, a 4MEMS array microphone may be used, with the following parameters: the frequency response range is 20Hz-20KHz, the signal-to-noise ratio is larger than 70DB, the highest pointing resolution angle is approximately equal to 15 degrees, the output interface is a USB or 3.5 earphone interface, and the limited range of voice recognition is 5 meters. The array microphone has the functions of a pickup and a loudspeaker, and can collect audition sound and play the collected sound. The array microphone is wireless and can be connected with the workstation, and collected voice and text content obtained after voice recognition can be transmitted to the workstation. In addition, the array microphone has a function of intelligent voice operation, for example, a talker may wake up the array microphone by saying "hello, XX", may start the recording operation by saying "start recording", and may also end the recording operation by saying "end recording". In addition, the main functions of the array microphone are to perform character recognition on characters, and two characters, namely a talker and a talked person, can be automatically separated when a recording job is performed. The microphone can distinguish fuzzy sound and noise through array design, the pickup distance is 5 meters, and the effective voice recognition distance is 2 meters.

Then, the speaking voice of the talker is converted into the first text, and the speaking voice of the talked is converted into the second text. In the present embodiment, the conversion between the speaking voice of the talker and the first text and the conversion between the speaking voice of the talked and the second text can be performed using the existing voice recognition technology. In addition, the first text and the second text are synchronously displayed through the display screen while the conversion is carried out, so that the on-site verification is facilitated, and the real-time supervision on the conversation process is facilitated; and the first text and the second text are synchronously broadcasted through the loudspeaker in a voice mode, so that the on-site prompt of the personnel handling the case is facilitated.

And then, identifying wrong entries in the first text and the second text, and replacing the wrong entries according to preset key entries. In this embodiment, the keyword entries may be predefined by a bookkeeper, the keyword entries include a place name and a person name, and when the incorrect entry in the first text and the second text is identified, the incorrect entry is replaced with the corresponding keyword entry. Meanwhile, the dialect entries in the first text and the second text are also identified, and the dialect entries are replaced according to the preset Mandarin entry in a similar mode, so that the dialect is replaced by the corresponding Mandarin, and the case entry is read and understood by the case handling personnel conveniently.

And then, detecting the voice frequency of the speaking voice of the person to be conversed, and marking the corresponding position of the voice frequency in the second text, which is not in the preset frequency range. For example, the speech frequency of normal adult speech ranges from 50 to 500Hz, if during the conversation, the person to be conversed is silent (refusing to answer questions, or pausing for a long time), whisper (answering with phrases, simple sentences, and spoken Buddhists), or excrescence (speaking with tautology to minimize the chance of speaking, avoiding some core questions, or distracting). Correspondingly, for the former, the voice frequency is lower than 50Hz, i.e. the frequency is too low compared with the normal speaking; in the latter case, the speech frequency may be higher than 500Hz, i.e. too high compared to the frequency of normal speech. In this way, the psychological impedance of the talker can be identified in real time, and the words spoken by the talker during the period of the psychological impedance are labeled, for example, the words are bolded or red, so as to prompt the clerk that the speech content of the session may be questioned.

Finally, the speaking voice of the talker and the talked person is listened back, and the first text and the second text are collated. For example, the clerk reads the first text and the second text while playing the speaking voice of the talker and the talked, thereby completing the proof reading. By the mode, in the case proceeding process, the real-time speaking voice of the talker and the talked can be converted into the first text and the second text in real time according to the role of the talker, and the case handling personnel can perform proofreading, so that the case record making quality is improved, and the working pressure of the case handling personnel is reduced.

Example 2

The difference from the embodiment 1 is only that, while the speech voices of the talker and the talked during the conversation are collected, the synchronous video recording of the talker and the talked during the conversation is also collected; and when the first text and the second text are corrected, the first text and the second text are corrected by combining the speaking voice and the synchronous recording. For example, the clerk reads the first text and the second text while playing the speaking voice of the talker and the talked, thereby completing the first proofreading work; the counter can read the first text and the second text while playing the synchronous video of the talker and the talked, thereby completing the second proofreading. And the accuracy of case handling and writing records is improved by twice proofreading.

Example 3

The only difference from embodiment 2 is that, in the present embodiment, the utterance voice is segmented in the form of a cut point to obtain a plurality of voice segments before detecting the voice frequency of the utterance voice of the talked person. First, it is determined whether the cut point is located in a blank area of the speech sound, that is, whether a sound is present at the position of the speech sound where the cut point is located: if sound exists at the position of the speaking voice where the tangent point is located, the tangent point is not located in the blank area of the speaking voice; on the contrary, if no sound exists at the position of the speaking voice where the cut point is located, the cut point is located in the blank area of the speaking voice. If the cutting point is positioned in the blank area of the speaking voice, the voice characteristics of the speaker cannot be lost by direct cutting, so that the cutting is directly carried out; otherwise, the segmentation is not directly performed. Then, if the cut point is not located at the blank region of the speech voice, it is determined whether the number of speakers including the talker and the talked person has changed, that is, whether the number of voiceprint features in the speech voice has changed is detected: if the number of the voiceprint features in the speaking voice is increased, the number of the speakers is increased, and if the number of the voiceprint features in the speaking voice is reduced, the number of the speakers is reduced, at the moment, the tangent point is moved to the position where the number of the speakers in the speaking voice is changed; on the contrary, if the number of the voiceprint features in the speaking voice is not changed, the number of the speakers is not changed, and the tangent point is not moved. In this way, the segmentation process can be properly simplified without losing the speaker's voice characteristics.

The foregoing is merely an example of the present invention, and common general knowledge in the field of known specific structures and characteristics is not described herein in any greater extent than that known in the art at the filing date or prior to the priority date of the application, so that those skilled in the art can now appreciate that all of the above-described techniques in this field and have the ability to apply routine experimentation before this date can be combined with one or more of the present teachings to complete and implement the present invention, and that certain typical known structures or known methods do not pose any impediments to the implementation of the present invention by those skilled in the art. It should be noted that, for those skilled in the art, without departing from the structure of the present invention, several changes and modifications can be made, which should also be regarded as the protection scope of the present invention, and these will not affect the effect of the implementation of the present invention and the practicability of the patent. The scope of the claims of the present application shall be determined by the contents of the claims, and the description of the embodiments and the like in the specification shall be used to explain the contents of the claims.

Claims

1. The voice recognition recording method based on role recognition is characterized by comprising the following steps:

2. The character recognition-based voice recognition recording method of claim 1, wherein in S1, the array microphone is used to collect the speaking voice of the talker and the talked during the conversation.

3. The character recognition-based voice recognition recording method according to claim 2, wherein in S1, synchronous videos of the talker and the talked during the conversation are collected; in S5, the first text and the second text are collated with the synchronous recording of the talker and the talked during the talking.

4. The character recognition-based voice recognition recording method according to claim 3, wherein in S2, the utterance voice of the talker is converted into the first text, and the first text is synchronously displayed; and converting the speaking voice of the talked into a second text, and synchronously displaying the second text.

5. The character recognition-based voice recognition recording method according to claim 4, wherein in S2, the speaking voice of the talker is converted into the first text, and the first text is broadcasted in synchronization with the voice; and converting the speaking voice of the talked into a second text, and synchronously voice-broadcasting the second text.

6. The character recognition-based voice recognition recording method of claim 5, wherein in S3, the dialect terms in the first text and the second text are further recognized and replaced according to a predetermined mandarin term.