CN113628623A

CN113628623A - Intelligent voice recognition processing method and system

Info

Publication number: CN113628623A
Application number: CN202111178759.5A
Authority: CN
Inventors: 周柳阳; 蒋林林
Original assignee: Shenzhen Yihao Hulian Technology Co ltd
Current assignee: Shenzhen Yihao Hulian Technology Co ltd
Priority date: 2021-10-11
Filing date: 2021-10-11
Publication date: 2021-11-09
Anticipated expiration: 2041-10-11
Also published as: CN113628623B

Abstract

The invention discloses an intelligent voice recognition processing method and system, which relate to the field of information monitoring, and can position surrounding organisms through multiple groups of recording information and sensing information and obtain the position information of the organisms making sound, so that the recording information is focused according to the position information, the recording information can be clearer and more prominent, the success rate and the accuracy rate of voice recognition conversion are improved, the success rate of voice recognition is greatly improved, and the problem that the prior art cannot effectively recognize in a noisy environment is solved.

Description

Intelligent voice recognition processing method and system

Technical Field

The invention relates to the field related to information monitoring, in particular to an intelligent voice recognition processing method and system.

Background

With the continuous development and rapid progress of the technology, the technology such as artificial intelligence is gradually mature, the development of the artificial intelligence brings a plurality of brand new control modes different from the traditional technology, such as brain wave control, eye movement control and the like in further research and development, and mature voice control, and the brand new technologies bring brand new changes to the life and production modes of people.

In the prior art, voice recognition mostly adopts recording and analyzes a sound track, so that voice content in the voice is recognized, and then voice is converted into text information, so that control keywords in the voice are extracted and response is carried out, and control is realized.

However, the processing method in the prior art has a problem of low recognition efficiency, and when the recording environment is noisy and more people are making sounds at the same time, the recognition conversion efficiency of the speech is low, so that the situation that the speech control cannot be performed in the noisy environment is easy to occur.

Disclosure of Invention

The present invention is directed to a method and a system for processing intelligent speech recognition, so as to solve the problems in the background art.

In order to achieve the purpose, the invention provides the following technical scheme:

an intelligent speech recognition processing method, comprising:

collecting and generating a plurality of groups of recording information, wherein the recording information comprises sound intensity information, and the plurality of groups of recording information are generated by a plurality of recording devices arranged at intervals;

collecting and generating multiple groups of sensing information, wherein the sensing information comprises biological heat source position information, and the multiple groups of sensing information correspond to the plurality of recording devices arranged at intervals one by one;

focusing the recording information according to the sound intensity information and the biological heat source position information to generate a plurality of groups of object voice contents, wherein the focusing is used for enhancing the sound intensity of a certain object in the recording;

and identifying and converting the multiple groups of object voice contents, responding to the object voice contents according to a preset execution voice instruction set, and generating and outputting response contents.

As a further scheme of the invention: the quantity more than or equal to three groups of recording equipment, the multiunit the recording equipment interval sets up, and the multiunit the recording equipment forms more than or equal to a finite plane.

As a further scheme of the invention: the step of performing focusing processing on the recording information according to the sound intensity information and the biological heat source position information to generate a plurality of groups of object voice contents comprises the following steps:

determining a sound emitting point according to the sound intensity information and the biological heat source information;

and carrying out focusing processing on the recording information according to the sound emitting point, and generating a plurality of groups of object voice contents.

As a further scheme of the invention: the step of determining a sound emission point according to the sound intensity information and the biological heat source information includes:

reading a plurality of biological heat source information in each group of sensing information;

acquiring biological position information by overlapping a plurality of biological heat source information in the sensing information groups;

reading the sound intensity information in the recording information, and overlapping multiple groups of sound intensity information to generate sound source range information, wherein the sound intensity information can be used for generating sound direction information;

and intersecting the sound source range information with the biological position information to generate a plurality of sound emitting points.

As a further scheme of the invention: the step of performing focusing processing on the recording information according to the sound emitting point and generating a plurality of groups of object voice contents includes:

acquiring position information of a plurality of sound emitting points;

and sequentially carrying out focusing processing on the multiple groups of recording information according to the position information of the sound emitting point to generate multiple groups of object voice contents.

As a further scheme of the invention: the focusing processing includes performing overlap enhancement on sound information of a sound emission point and performing cancellation and attenuation on sound information other than the sound emission point.

As a further scheme of the invention: and the step of recognizing the sound information of the object voice content is used for recognizing the sender of the object voice content, responding the object voice content according to a preset execution voice instruction set, and generating and outputting response content.

An intelligent speech recognition processing system, comprising:

the recording acquisition module is used for acquiring and generating recording information, the recording information comprises sound intensity information, the number of the recording acquisition modules is multiple, the recording acquisition modules are arranged at intervals, and each recording acquisition module corresponds to one group of recording information;

the biological acquisition module is used for acquiring and generating sensing information, the sensing information comprises biological heat source position information, the biological acquisition module and the recording acquisition module are arranged in a one-to-one correspondence mode, and each biological acquisition module corresponds to one group of sensing information;

the object confirmation module is used for carrying out focusing processing on the recording information according to the sound intensity information and the biological heat source position information to generate a plurality of groups of object voice contents, and the focusing processing is used for enhancing the sound intensity of a certain object in the recording;

and the voice processing module is used for identifying and converting the plurality of groups of object voice contents, responding to the object voice contents according to a preset execution voice instruction set, and generating and outputting response contents.

As a further scheme of the invention: the number of the recording acquisition modules is more than or equal to three groups, and a plurality of groups of the recording acquisition modules form more than or equal to one limited plane.

As a further scheme of the invention: the object confirmation module includes:

the sound source positioning unit is used for determining the sound emitting point according to the sound intensity information and the biological heat source information;

and the recording focusing unit is used for carrying out focusing processing on the recording information according to the sound emitting point and generating a plurality of groups of object voice contents.

Compared with the prior art, the invention has the beneficial effects that: through the setting of relevant step for can fix a position peripheral living beings through multiunit recording information and sensing information, and obtain the position information that sound sent living beings, thereby focus on the recording information according to this position information, make it more clear and outstanding, improve speech recognition's conversion's success rate and rate of accuracy, great improvement speech recognition's success rate, solved prior art can't effective recognition's problem in noisy environment.

Drawings

Fig. 1 is an overall flow chart of an intelligent speech recognition processing method.

Fig. 2 is a block diagram of a flow of a step of generating object speech content in an intelligent speech recognition processing method.

Fig. 3 is a block diagram of a flow of steps for generating a voice issue point in an intelligent speech recognition processing method.

Fig. 4 is a block diagram of a flow chart of obtaining the speech content of a speech object according to location information in an intelligent speech recognition processing method.

Fig. 5 is a block diagram showing the structure of an intelligent speech recognition processing system.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The following detailed description of specific embodiments of the present invention is provided in connection with specific embodiments.

As shown in fig. 1, an intelligent speech recognition processing method provided for an embodiment of the present invention includes the following steps:

in this embodiment, the present invention is directed to provide an intelligent speech recognition processing method, and compared with the conventional speech recognition, the method can assist in completing processing and highlighting of a target sound by positioning a person who generates the sound through a sound energy direction, so that the sound in a noisy environment can be more accurately and conveniently analyzed and responded, and a success rate of recognizing the sound in the noisy environment is increased.

S200, collecting and generating a plurality of groups of recording information, wherein the recording information comprises sound intensity information, and the plurality of groups of recording information are generated by a plurality of recording devices arranged at intervals.

In this embodiment, a plurality of sets of recording information are acquired, which is completed by a plurality of recording apparatuses arranged at intervals, and by such a plurality of sets of recording information including the sound intensity information, it is possible to roughly locate the range in which a sound suddenly emitted in the environment is responded to.

S400, collecting and generating multiple groups of sensing information, wherein the sensing information comprises biological heat source position information, and the multiple groups of sensing information correspond to the plurality of recording devices arranged at intervals one by one.

In this embodiment, a plurality of sets of sensing information are acquired through a plurality of sensing devices corresponding to the recording devices one to one, where the sensing information is used to accurately assist in positioning the biological information around the device, so as to determine the specific position of the device, and after the position of the surrounding living being is determined, different sounds in the recording information acquired in step S200 can be corresponding to the living being, so as to further process the recording information.

S600, performing focusing processing on the recording information according to the sound intensity information and the biological heat source position information to generate a plurality of groups of object voice contents, wherein the focusing processing is used for enhancing the sound intensity of a certain object in the recording.

In this embodiment, in this step, processing is performed according to the data collected in step S200 and step S400, the data collected in step S400 can obtain the biological distribution information (including the directional distance relative to the position of the collection device) around the collection device, and then the recording information can be focused according to the biological distribution information, that is, the sound emitted from a certain point can be superimposed and enhanced according to multiple sets of recording information, so as to make it more prominent and facilitate the identification processing.

And S800, identifying and converting the multiple groups of object voice contents, responding to the object voice contents according to a preset execution voice instruction set, and generating and outputting response contents.

In this embodiment, in this step, the target voice content generated after the sound superposition and enhancement processing in step S600 is processed, that is, the voice content is identified according to the identification library, and the identified and converted voice content is responded according to the preset instruction set, the response rule, and the like.

As another preferred embodiment of the present invention, the number of the sound recording devices is greater than or equal to three groups, a plurality of groups of the sound recording devices are arranged at intervals, and a plurality of groups of the sound recording devices form a limited plane.

In this embodiment, the number of the recording devices, that is, the number of the recording information in step S200, is limited, the recording information must be acquired by a plurality of recording devices that are arranged at certain intervals, and the recording devices must be arranged at positions that do not pass through the same straight line, so that the spatial midpoint can be positioned, and the recording information can be processed.

As shown in fig. 2, as another preferred embodiment of the present invention, the step of performing a focusing process on the recording information according to the sound intensity information and the biological heat source position information to generate a plurality of sets of target voice contents includes:

s601, determining a sound emitting point according to the sound intensity information and the biological heat source information.

S602, performing focusing processing on the recording information according to the sound emitting point, and generating a plurality of groups of object voice contents.

In this embodiment, the step S600 is simply divided into two parts, namely, positioning the sound source and performing focusing processing on the sound source according to the positioning, where the positioning of the sound source is performed by generating rough range positioning based on sound collection, and then performing accurate sound source positioning in cooperation with a biosensor, so that the non-biological sound information can be filtered and removed to a certain extent (for example, a pre-recorded voice content is played by a recording device or the like, but the method of the present invention still cannot be processed when the recording device and the organism overlap), where the focusing processing is based on a plurality of recording devices, that is, when the position of an occurrence point is obtained, a voice of a certain point can be processed by a plurality of groups of recording devices distributed in space, so that the sound reception is more clearly and prominently performed.

As shown in fig. 3, as another preferred embodiment of the present invention, the step of determining a sound emitting point according to the sound intensity information and the biological heat source information includes:

s6011, reading a plurality of biological heat source information in each set of the sensing information.

S6012, overlapping a plurality of sets of the sensing information to obtain biological position information.

S6013, reading the sound intensity information in the recording information, and generating sound source range information according to overlapping of multiple sets of the sound intensity information, where the sound intensity information may be used to generate sound direction information.

S6014, intersecting the sound source range information and the biological position information to generate a plurality of sound emitting points.

In this embodiment, the step of determining the sound emitting point in step S601 is further subdivided, and the distribution of the peripheral living beings can be sensed by a plurality of spatially distributed biosensors, so that we can obtain a plurality of point location information, where the point location information corresponds to a certain living being; a rough range of sound emission can be determined through sound intensity information (suddenly receiving that the sound intensity in a certain direction changes, which indicates that the sound emission possibility of people exists in the direction) in a plurality of groups of recorded information, and then the position information (the sound emitted by a living being (particularly a person)) of a sound emission point can be obtained by taking intersection of the range and the point location information of the living being.

As shown in fig. 4, as another preferred embodiment of the present invention, the step of performing a focusing process on the sound recording information according to the sound emission point and generating a plurality of sets of target voice contents includes:

s6021, acquiring position information of the plurality of sound emission points.

And S6022, sequentially carrying out focusing processing on the multiple groups of recording information according to the position information of the sound emission point to generate multiple groups of target voice contents.

As another preferred embodiment of the present invention, the focusing process includes performing overlap enhancement on sound information of a sound emission point and performing cancel attenuation on sound information other than the sound emission point.

In the present embodiment, step S602 is described in a simple manner, that is, the voice content is focused according to the position information of the specified sound emission point (which may be understood as a focusing manner similar to a video camera, and focusing is performed by distance measurement of a shooting object plane).

As another preferred embodiment of the present invention, the method further includes a step of recognizing sound information of the target voice content, configured to recognize a sender of the target voice content, and when the step of responding to the target voice content according to a preset execution voice instruction set and generating and outputting a response content is executed, if a plurality of target voice contents that need to be responded are simultaneously included, sequentially executing responses to the target voice content according to a preset sender authority level.

In this embodiment, the supplementary step is a step effective in step S800, which illustrates that the method can analyze the sound to determine the object from which the sound is emitted (on the premise that the information of the object is preset in the library), and further, when receiving multiple instructions at the same time, the method can have the execution sequence of one instruction, that is, the instruction execution priority.

As shown in fig. 5, the present invention is directed to an intelligent speech recognition processing system, comprising:

recording collection module 100 for gather and generate recording information, recording information includes sound intensity information, recording collection module 100's quantity is a plurality of, and a plurality of recording collection module 100 intervals set up, every recording collection module 100 all corresponds a set of recording information.

The biological acquisition module 300 is configured to acquire and generate sensing information, where the sensing information includes biological heat source position information, the biological acquisition module 300 and the recording acquisition module 100 are arranged in a one-to-one correspondence manner, and each biological acquisition module corresponds to a group of sensing information.

And the object confirmation module 500 is configured to perform focusing processing on the recording information according to the sound intensity information and the biological heat source position information to generate multiple sets of object voice contents, where the focusing processing is used to enhance the sound intensity of a certain object in the recording.

The voice processing module 700 is configured to perform recognition and conversion processing on multiple sets of the object voice content, respond to the object voice content according to a preset execution voice instruction set, and generate and output response content.

As another preferred embodiment of the present invention, the number of the recording acquisition modules 100 is greater than or equal to three groups, and a plurality of groups of the recording acquisition modules 100 form a limited plane, and the biological acquisition modules 300 are arranged in one-to-one correspondence with the recording acquisition modules 100.

As another preferred embodiment of the present invention, the object confirmation module 500 includes:

and the sound source positioning unit is used for determining the sound emitting point according to the sound intensity information and the biological heat source information.

It should be understood that, although the steps in the flowcharts of the embodiments of the present invention are shown in sequence as indicated by the arrows, the steps are not necessarily performed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in various embodiments may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. An intelligent speech recognition processing method, comprising:

2. The intelligent speech recognition processing method of claim 1, wherein the number of the sound recording devices is greater than or equal to three groups, and the plurality of groups of the sound recording devices form a finite plane greater than or equal to one.

3. The intelligent speech recognition processing method according to claim 1, wherein the step of performing the focus processing on the recording information according to the sound intensity information and the biological heat source position information to generate a plurality of sets of object speech contents comprises:

4. The intelligent speech recognition processing method of claim 3, wherein the step of determining a sound emission point based on the sound intensity information and the biological heat source information comprises:

reading the sound intensity information in the recording information, and generating sound source range information according to overlapping of a plurality of groups of sound intensity information, wherein the sound intensity information can be used for generating sound direction information;

5. The intelligent speech recognition processing method according to claim 4, wherein the step of performing the focus processing on the sound recording information according to the sound emission point and generating a plurality of sets of object speech contents includes:

acquiring position information of a plurality of sound emitting points;

and sequentially carrying out focusing processing on the multiple groups of recording information according to the position information of the sound emitting point to generate a plurality of object voice contents.

6. The intelligent speech recognition processing method according to claim 5, wherein the focus processing includes performing overlap enhancement on sound information of a sound emission point and performing cancel reduction on sound information other than the sound emission point.

7. The intelligent speech recognition processing method according to claim 1, further comprising a step of recognizing sound information of the target speech content, for recognizing a speaker of the target speech content, and when the step of responding to the target speech content according to a preset execution speech instruction set, generating and outputting a response content is performed, if a plurality of target speech contents requiring a response are included at the same time, sequentially performing a response to the target speech content according to a preset speaker authority level.

8. An intelligent speech recognition processing system, comprising:

9. The intelligent speech recognition processing system of claim 8, wherein the number of the recording acquisition modules is greater than or equal to three groups, and the plurality of groups of the recording acquisition modules form greater than or equal to one finite plane.

10. The intelligent speech recognition processing system of claim 9, wherein the object confirmation module comprises: