Face Recognition with Machine Learning in OpenCV - Fusion of the results with the Localization Data of an Acoustic Camera for Speaker Identification Johannes Reschke; Armin Sehr Department of Electrical Engineering and Information Technology Ostbayerische Technische Hochschule Regensburg 93049 Regensburg, Germany johannesl.reschke@st.oth-regensburg.de; armin.sehr@oth-regensburg.de Abstract — This contribution gives an overview of face recogni- tion algorithms, their implementation and practical uses. First, a training set of different persons’ faces has to be collected and used to train a face recognizer. The resulting face model can be utilized to classify people in specific individuals or unknowns. After tracking the recognized face and estimating the acoustic sound source’s position, both can be combined to give detailed information about possible speakers and if they are talking or not. This leads to a precise real-time description of the situation, which can be used for further applications, e.g. for multi-channel speech enhancement by adaptive beamformers. Keywords — OpenCV, Sound Source Localization, Machine Learning, Sensor Fusion, Software Engineering and Software Technologies I. Introduction In recent times, the interest in and research on acoustic source localization and enhancement of certain sound sources has increased dramatically due to the growing desire for hands- free interaction with various devices [18]. Combining the abil- ity to locate sound sources and to recognize possible speakers with a camera potentially enables machines to identify speak- ers. This makes a human-machine-interaction a lot easier, more adaptive and reliable. Comparable systems can be used for teleconferencing, smart rooms or ambient assisted living [5, 8]. Combining a microphone array’s ability to locate sound sources and the intuitive way of extracting information of a webcam’s image, acoustic cameras have become quite popular for many industrial segments. They compute a color-coded sound map and thus, they visualize the sound pressure levels of a user-defined field of view. This way, acoustic cameras can locate sound sources quite accurately, which is why they are often used to identify unwanted noise sources [12, 31]. Face recognition is a machine learning technique, which ideally allows detecting and identifying all faces seen in a pic- ture or a video frame. It can be used for criminal detection, image processing, human computer interaction, etc. [33, 34]. In the early development of face recognition systems, geometric facial features, e.g. eyes, nose and mouth, were explicitly used. Properties of these features and relations (e.g. positions, dis- tances, angles) between them were used as descriptors for face recognition [15]. Today, holistic techniques, e.g. principal component analysis (see Eigenfaces) or linear discriminant analysis (see Fisherfaces), are used to identify individuals [3, 13]. II. Basics of an Acoustic Camera The implementation of an acoustic camera requires a suita- ble microphone array as well as beamforming algorithms to locate sound sources precisely. Given both, a color coded sound map of the measured sound pressure level can be com- puted and displayed as seen in Figure 3 [25, 26]. A. A suitable microphone array In [26] a appropriate microphone array (see Figure 1) for speaker and sound source localization has been developed and verified. It can be shown that double ring arrays with an odd number of microphones on each ring are desirable for locating speech sources [20]. In this project, the inner ring has a diame- ter of 0.2 m, while the outer ring is twice as large. An im- portant part of acoustic cameras is a sound analysis and visual- ization software [31]. This software can for example be in- stalled on a personal computer. The connection of microphones with any computer is achieved by using a microphone amplifi- er and a multi-channel sound card. In particular, two RME OCTAMIC XTC amplifiers and a RME MADIface USB are used. Both amplifiers digitize analog signals of up to eight channels completely synchronously. By interconnecting two RME OCTAMICs in series, up to 16 analog signals can be converted to digital values. Thus, in order to read out all signals synchronously, it is necessary to activate a delay compensation in each amplifier [27] . Similar to temporal undersampling, which causes temporal aliasing effects, spatial undersampling can lead to spatial alias- ing. This effect can be observed in acoustic cameras’ color maps as incorrectly detected sound sources [30]. In order to minimize spatial aliasing, there are multiple approaches possi- ble. In [20] it is shown that an odd numbers of microphones on each ring of the array can reduce redundancy, which results in more robustness against spatial aliasing. Ring arrays in general decrease the redundancy of microphone arrays, because at a certain frequency only a few microphone pairs are affected by spatial aliasing, while others are not yet [9]. To build a sensor array, utilizing omnidirectional microphones, such as the se- lected condenser microphones AKG CK-92, has been shown advantageous [8, 25, 26]. Figure 1 : Developed double ring array Using more microphones and distribute them randomly on a plane are two possible improvements of an acoustic camera’s microphone array. B. Steered Response Power Beamforming Algorithm Beamforming algorithms process signals in a way, so that desired directions are enhanced, while signals from all other directions are attenuated. This chosen direction can be called steering direction, with which a defined plane can be spatially sampled. The beamformer’s output, when used in this way, is known as the steered response [12]. As seen in Figure 2, the Steered Response Power Algorithm with Phase Transform weighting (SRP-PHAT) calculates a color map by summing certain values of signal pairs’ weighted cross correlation (GCC). It is broadly known that a correlation results in a sig- nal’s power and thus, the steered response outputs a power. This is why the described method is known as a steered re- sponse power beamformer [11, 12]. Figure 2: Schematic diagram of the SRP-PHAT algorithm, according to [17] The mentioned GCC is similar to a regular cross correlation with the only difference, that weighted input signals are used. In order to get a PH AT weighting the Fourier transformed sig- nals Xi(co) and the complex conjugate of X 2 (co) are used as seen in (1). TDOA stands for time difference of arrival and it esti- mates the time difference of two sensors’ signals. The TDOAs for a single microphone pair differ with the steering direction [11,25]. ~ |X 1 (aOX 2 («)*| (1) In [25] it has been shown, that best sound source localiza- tion results can be achieved by combining the SRP-PHAT al- gorithm with a constantly weighted SRP beamformer. This is because the SRP-PHAT is unable to process narrowband sig- nals, while being very robust against reverberations and sensor self noise. The SRP method is not as robust as the SRP-PHAT algorithm, but in contrast to that, it is able to locate narrowband signals such as sine waves or spoken vowels. A combination of both is implemented by utilizing a threshold for the signal’s bandwidth. In this application, the bandwidth’s threshold is set to 4 kHz, which is approximately an eighth of the chosen sam- pling rate. A typical output image of an acoustic camera can be seen in Figure 3. Figure 3: Resulting output image of an acoustic camera III. Face Detection An ideal face detection system should be able to detect all faces shown in a picture or a video frame. For this task, it should neither matter in which position or orientation the faces are, nor which age, sex or ethnical origin the people to be clas- sified belong to. Furthermore, an ideal face detection system should be insensitive to lighting changes or other external in- fluences [16]. In OpenCV, a face library is implemented, which provides pre-trained face detectors as well as the possibility to train own classifiers [2, 22]. Pre-trained classifiers for Haar-like and local binary pattern features support frontal face, facial landmark and whole person detection. If a self-trained classifier is used, sev- eral thousand pictures of non-faces and faces should be collect- ed. A good training set considers faces with differences in age, sex, ethnical origin, facial hair, lighting and hairstyle [6, 13, 21]. Because of the complexity of the training process, only pre-trained classifiers for frontal faces are utilized in this con- tribution. While using the later described face detection algorithms utilizing Haar-like or local binary pattern features, often multi- ple faces result from one facial image. If these detected faces are located in a specific area close to each other, they are aver- aged in size and position to merge them into one detection re- sult. This avoids multiple detections for a single person and reduces false positive rates [2, 28]. non- face images. The number of weak classifiers per stage is determined by a defined false positive rate, which has to be achieved in each stage. Thus, it can be imagined that in the first few stages, only few features are necessary to get to this rate, but at the very last stage, very many are needed. Stages are added as long as a total false positive rate is met [35]. Using local binary patterns, more sophisticated Haar-like features or additional non-frontal face detectors, higher detec- tion accuracies can be achieved [13, 28, 35]. Both, the Haar cascade and the local binary pattern classifi- er are implemented as cascaded classifiers to quickly reject non-faces but still keep a high accuracy for positive results (see Figure 4). Figure 4: Schematic description of the detection cascade, ac- cording to [35] B. Local Binary Pattern Classifier Local binary patterns (LBPs) describe local relationships between neighboring pixels in a 3x3 environment. Starting in the top left corner and proceeding clockwise, the pixels’ gray- scale values are compared to the center pixels’. If the value of the center pixel at (n,m) is bigger than the neighbor’s value, a 0 results, 1 else. These binary values can be put together and converted into a grayscale value of 0...255. Formally, this can be written as [16, 28] 7 LBP{n,m) — ^ 5(i„ — i(n, m)) ■ 2 n 11 = 0 A. Haar Cascade Classifier The Haar cascade classifier is a quite easy face detection method, and is therefore a very good basis for more complex algorithms. With a huge dataset, many different objects can be trained, e.g. faces, cars or whole persons. In order to classify images, Haar-like features (see Figure 5) are used and calculat- ed extremely efficiently with integral images. Thus, regional knowledge can be considered. As very common, the face de- tector introduced in [35] can only handle grayscale images [6, 15,35]. Figure 5: First two Haar-like features, according to [35] In pictures with a resolution of 24x24 Pixel, more than 180,000 different Haar-like features can be found. Using a ma- chine learning (ML) algorithm, the 6,061 most important fea- tures can be chosen and organized in a cascade structure. Train- ing the chosen features f results in a threshold 0j and a parity pj. With features being all pixel values added in black blocks and subtracted from the sum of pixel values in the white blocks, the weak classifier can be described as [35] v*)= fUU' //W

{B| [C) (D) (E) oooooooo lirniii oooooiio nioooii ooooom Figure 6: LBPs for points (A, B), lines (C), edges (D) and cor- ners (E), according to [23] Similar to the stage structure of a Haar cascade classifier, a weak classifier is formed by using a gray value histogram. With H n (X) being the classifier for the n - th stage and h( n , m )(x) being the histogram value for listing x, the weak classifier can be described as in (4). The LBP classifier is not only faster than the Haar cascade classifier, but theoreticaly even more precise [3,28]. = ^{n,m)00 (n.m)£W n IV. Face Recognition Even though face recognition is a much more challenging task than face detection is, today’s face recognition systems are, at least under optimal conditions, very reliable [2, 14]. Thus, many of today’s applications use these methods to identi- fy people in images, e.g. Facebook’s Gallery or Apple iPhoto [32]. When classifying, problems mostly occur due to varia- tions in light, perspective or facial expression [14, 32]. Fur- thermore, similar looking individuals, e.g. father and son or twins, can cause uncertainties when differing them [15]. In order to get a higher face recognition accuracy, different approaches can be imagined. In general, it is recommended to use large datasets with many variations in pose, age and light- ing conditions for training the model [3, 13, 32]. Another pos- sibility to improve the recognition performance is to use infra- red lighting to avoid shades or other disruptions. Furthermore, additional features, which only occur under invisible light, e.g. freckles and pigmentation, can be used to recognize faces [15]. In OpenCV, face recognizers using principal component analysis (Eigenfaces), linear discriminant analysis (Fisherfaces) and local binary pattern histograms are implemented [23]. Uti- lizing any of these methods, the face recognizer has to be trained with own face-images to differentiate between individ- uals. The classification is done by comparing the images’ fea- tures in a high-dimensional feature space with a K-nearest neighbor algorithm [6]. A. Eigenface Classifier The Eigenface classifier uses a principal component analy- sis (PC A) to reduce the dimensionality of the images. Utilizing a PCA, E eigenvectors with the highest eigenvalues can be selected to describe the given dataset. These eigenvectors span a quite low-dimensional face space, in which every image can be projected. Because the PCA’s eigenvectors, after reshaping them into an image format, look very much like faces (see Fig- ure 7), they are called Eigenfaces [4, 34]. Figure 7: First 20 Eigenfaces of the AT&T face dataset, ac- cording to [23] Using the extracted Eigenfaces, unknown faces can be re- constructed (see Figure 8). This gives a good feeling about how many principal components are necessary to distinguish be- tween individuals. Usually a number of 40 to 80 should be suf- ficient, but, depending on the dataset, sometimes up to 300 Eigenfaces should be used [13, 23, 36]. Figure 8 shows that the original face can be recognized starting at 20 Eigenfaces. In order to calculate a dataset’s (T 1 ^ 2 , r 3 , /V eigenvec- tors efficiently, its vectorized mean image W and the differ- ences 0i=Ti- V are needed. Given a face matrix A = [0i 02 ... 0r], the covariance matrix of all face-images can be calculated as [34] R C = j£»r‘H : = AA T ,5) r= 1 This Matrix C inherits a dimension of N 2 xN 2 (for face images with a resolution of NxN), which means that N 2 eigenvektors uu have to be determined. This requires high computational resources and thus, it is unsuitible for real-time applications. For the common case that R (7) Appling an Euclidian distance measure and the K-nearest neighbor method, faces can be classified [19, 34]. B. Fisherface Classifer Dimensionality reduction by linear discriminant analysis (LDA) can counter the PCA’s disadvantage, not considering any class dependencies while projecting images in a feature space. Using Fisher’s linear discriminant analysis (FLD), the classes stay linearly separable, which makes classification easi- er and more reliable, especially for changes in lighting condi- tions. Thus, E orthonormal vectors describe a matrix W , so that it maximizes between-class scatter (see (8)) but minimizes within-class scatter (see (9)). For both, matrices Sb , -SVhave to be defined classifier, the Haar cascade classifier is a little slower, but still capable of consistently classify pictures into faces and non- faces. In order to detect the left and right eye in an image, the corresponding Haar cascade eye detectors are utilized. c s B = 2 J Nr(.v,-vxv t -v) r i=i (8) C Sk,= Z 1=1 x k £Xi ( 9 ) where C is the number of different classes, Ni the number of test images of class X t and Wi is its mean image. For face recogniton tasks, the LDA projection W op t can be written as in (10). Using for example a PCA, the dimension reduces to N-C, while the FLD reduces it further to C-l [4]. Wl pt ' = w T fld w T pca (10) with W T pca =arg m w\W T $ T W\ (11) T max \w‘W‘, ca S R W vca W\ Hd arE " \W T WlaS w W pca W\ (12) The Fisherface methods provides better handling for back- ground and lighting, than Eigenfaces do [13]. Furthermore, Fisherfaces are much more reliable when using a small training set or faces differing heavily from the training data, e.g. wear- ing glasses or facial expressions [4, 23]. C. Local Binary Pattern Histogram Classifier The classification using local binary pattern histograms (LBPH) is quite similar to the face detection with LBP. Differ- ences are that in order to identify individuals, a K-nearest neighbor method is utilized and the LBP operator can be ex- tended to get results that are more reliable. For this, multiple approaches are possible. Instead of considering eight direct neighbors, P neighboring pixels on a radius R can be used for the generalized LBP p>r operator [1, 15, 23]. Another option is to use a multi-block LBP to compare the avarage gray scale values of neighboring pixel blocks with the avarage of a centered region [15]. The LBPH classifier’s main disadvantage is that it is quite slow and therefore unsuitable for fluent video playback in real- time situations [1]. Thus, as described in the following chapter, it cannot be used in the implemented application. V. Application of a Face Recognition System In order to compensate changes in lighting, face rotation, background and hairstyle, some preprocessing steps are taken before recognizing faces (see Figure 9). It can be assumed that this allows to apply the classifiers not only to constrained envi- ronments, but also to any [34]. As a suitable face detection algorithm, the Haar cascade classifier is chosen. Even though, OpenCV’s LBP classifier shows in practice an approximately 61 % faster processing time, it is less accurate and is unable to detect faces reliably in an artificially lighted office room. In comparison to the LBP (£) face detection (C) resize image to lower resolution (H) eye detection {F, G) cut out appro*, eye region+histogram equalization Figure 9: Preprocessing steps for face recognition, according Figure 10: Recognition accuracy vs. number of training faces To find the best face recognizer for the described applica- tion, several tests are considered. In particular, these tested the accuracy, training duration, model size and recognition speed over number of components (Eigenfaces/Fisherfaces) and clas- ses in the training set as well as their total recognition accuracy over the number of training images (see Figure 10). It can be seen, that, after applying the preprocessing steps, the Eigenface method constantly outperforms the Fisherface method, even though. For three and more faces, the FBPH is slightly more accurate than the Eigenface method. The test also shows that the training duration of Eigenfaces is for relatively small numbers of components slightly lower than the Fisher- face method, while this changes for larger numbers of compo- nents. The training duration of LBPH is by far longer, especial- ly when using larger radii and more neighbors. This also shows in comparing them at model sizes. LBPHs’ model sizes are a lot larger than PC As’ and LDAs’ models, which show the same magnitude, even though the LDA models are smaller. The Fisherfaces’ recognition speeds are, starting at equally many classes and components, slightly higher than the Eigenfaces’. Before that point, both are almost identically high and about five times bigger than the EBPH’s recognition speed. As seen in Figure 11, the Fisherfaces’ recognition accuracy reaches its maximum approximately at the number of components being equal to the number of classes. This can be explained with the amount of class in the dataset. As described in section IV.B, the number of components is limited to C-7, which means add- ing additional components would not increase the recognition accuracy. Similar to this, the Eigenfaces’ recognition accuracy reaches its maximum at R , the number of images in the training set (see section IV. A). Using five classes times ten images (mi- nus two for testing), this limit is reached at approximately 40 components. Figure 1 1 : Recognition accuracy vs. number of components These performance measures suggest using an Eigenface classifier because of its higher recognition accuracy and else similar properties. Further tests has shown, that using approxi- mately 30 components, show best recognition results. This matches literatures suggestions [13, 23, 36]. [4] recommends not using the first three Eigenfaces to achieve even higher ac- curacies. Unfortunately this option is not supported by OpenCV [24]. In the final application, images for a face dataset have been collected. There, four individuals with approximately 1400 images are considered. The system implemented in OpenCV runs in real-time, providing approximately 15-18 frames per second. VI. Fusion of the Focalization Results for Speaker Identification A fusion of the localization results of sound source locali- zation and face recognition is able to enhance the reliability of a speaker detection system and enables it to identify the speak- er. This can be used in smart rooms, improved speaker tracking for videoconferencing or applications for ambient assisted liv- ing [7, 8, 29]. Furthermore, an extension towards gesture recognition for human machine interaction is possible [7, 10]. Therefore, a speaker identification algorithm is developed and introduced in this contribution. In order to track and identify speakers, reliable sound source localization and face recognition are necessary. The sound source and therefore the potential speaker, is located by finding the color map’s maximum. The face recognition pro- vides a specific localization as well, but sometimes there are some uncertainties, which have to be eliminated. These could be falsely positive detected and recognized faces or wrongly classified individuals. To overcome these problems, three rec- ognized faces are compared. If all of them are classified as the same person, the result is shown at the face’s new position. Another possibility is that the recognized face matches one of the two previously identified individuals and is in approximate- ly the same localization area as the currently detected. This results in a certain recognition at the currently classified face’s position, too. If both options do not apply, no recognition result is being displayed and it is ignored like there never has been a face in the image. Using both, the speaker and face localization results, the overall localization and identification can be achieved as seen in Figure 12. The estimated outcome can dif- ferentiate between following: no result, identified speaker, face only, unknown speaker or loudspeaker, loudspeaker and known face at two different positions. Whenever possible, the localiza- tion position is chosen to the face’s location, because it is well known that optical tracking algorithms have better spatial reso- lution than acoustic localization techniques [8]. Figure 12: Decision tree for localization result Figure 13 shows a possible output image of the implement- ed speaker identification system. There, a speaker’s face and sound source location are detected and merged for a more pre- cise localization and identification of the sound source. The red circle marks the speaker’s approximate position. Figure 13: Result of the speaker identification system VII. Conclusion This contribution briefly explains the basics of an acoustic camera and shows, why it makes sense to use a double ring array with an odd number of microphones. Additionally, it gives an overview of the implemented sound source localiza- tion methods. It can be shown, that a combination of SRP and SRP-PHAT algorithms is desirable for speech localization. Furthermore, this contribution gives an introduction to face detection and recognition methods. It is shown that Haar cas- cade classifiers outperform local binary pattern classifiers, when detecting faces. Similar, it is pointed out that for a face recognition system, Eigenfaces should be preferred to Fisher- faces and local binary pattern histograms. Finally, an algorithm for the fusion of localization results is introduced. This com- bines sound sources localization and face detection to identify speakers reliably. References [1] Ahonen, T., Hadid, A. u. Pietikainen, M.: Face Recognition with Local Binary Patterns. In: European Conference on Computer Vision 2004, Springer- Verlag Berlin, pp. 469-481 [2] Arubas, E.: Face Detection and Recognition (Theory and Practice), 2013. http://eyalarubas.com/face-detection-and-recognition.html, accessed on: 16.02.2017 [3] Baggio, D. L.: Mastering OpenCV with practical computer vision projects. Step-by- step tutorials to solve common real-world computer vision problems for desktop or mobile, from augmented reality and number plate recognition to face recognition and 3D head tracking. Birmingham: Packt Publ 2012 [4] Belhumeur, P. N., Hespanha, J. P. u. Kriegman, D. J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. In: IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 7, JULY 1997 [5] Bergh, T. F., Hafizovic, I. u. Holm, S.: Multi-speaker voice activity detection using a camera-assisted microphone array. IWSSIP Bratislava 2016. The 23rd International Conference on Systems, Signals and Image Processing : Bratislava, Slovakia, 23-25 May 2016 : proceedings. Piscataway, NJ: IEEE 2016 [6] Bradski, G. R. u. Kaehler, A.: Learning OpenCV. Software that sees. 1005 Gravenstein Highway North, Sebastopol, CA 95472.: O'Reilly 2008 [7] Busso, C., Georgiou, P. G. u. Narayanan, S.: REAL-TIME MONITORING OF PARTICIPANTS’ INTERACTION IN A MEETING USING AUDIO-VISUAL SENSORS. In: Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference, pp. 685-688 [8] Busso, C., Hemanz, S., Chu, C.-W., Kwon, S.-i., Lee, S. U., Georgiou, P. G., Cohen, I. u. Narayanan, S.: Smart Room: Participant and Speaker Localization and Identification. Proceedings / 2005 IEEE International Conference on Acoustics, Speech, and Signal Processing. May 18 -23, 2005, Pennsylvania Convention Center/Marriott Hotel, Philadelphia, Pennsylvania, USA. Piscataway, NJ: IEEE Operations Center 2005, pp. 1 1 17-1 120 [9] Clenet, B.: Circular Microphone Array Based Beamforming And Source Localization On Reconfigurable Hardware, Graz University of Technology. Master Thesis. Graz, Austria 2010 [10] Dai, J., Wu, J., Saghafi, B., Konrad, J. u. Ishwar, P.: Towards privacy-preserving activity recognition using extremely low temporal and spatial resolution cameras. 2015 IEEE Conference on Computer Vision and Pattern Recognition workshops (CVPRW). 7 - 12 June 2015, Boston, MA. Piscataway, NJ: IEEE 2015, pp. 68-76 [11] DiBiase, J., Silverman, H. u. Brandstein, M.: Robust Localization in Reverberant Rooms. In: Brandstein, M. u. Ward, D. (Hrsg.): Microphone arrays. Signal processing, techniques and applications. Engineering online library. Berlin: Springer 2001, pp. 157-180 [12] DiBiase, J. H.: A High- Accuracy, Low-Latency Technique for Talker Localization in Reverberant Environments Using Microphone Arrays, Brown University PhD Thesis. Providence, Rhode Island 2000 [13] Howse, J., Puttemans, S., Hua, Q. u. Sinha Utkarsh: OpenCV 3 Blueprints. Community experience distilled, s.l.: Packt Publishing 2015 [14] Huang, T., Xiong, Z. u. Zhang, Z.: Face Recognition Applications. In: Li, S. Z. u. Jain, A. K. (Hrsg.): Handbook of Face Recognition. London: Springer- Verlag London Limited 2011, pp. 617-638 [15] Li, S. Z. u. Jain, A. K. (Hrsg.): Handbook of Face Recognition. London: Springer- Verlag London Limited 2011 [16] Li, S. Z. u. Wu, J.: Face Detection. In: Li, S. Z. u. Jain, A. K. (Hrsg.): Handbook of Face Recognition. London: Springer- Verlag London Limited 2011, pp. 277-303 [17] Lombard, A. J. V.: Localization of Multiple Independent Sound Sources in Adverse Environments. Lokalisierung mehrerer unabhangiger Schallquellen in widrigen Umgebungen, Universitat Erlangen-Niirnberg Promotionsschrift. Erlangen- Niimberg 2012 [18] Mabande, E.: Robust Time-Invariant Broadband Beamforming as a Convex Optimization Problem. Robuste zeitinvariante Breitband-Keulenformung als konvexes Optimierungsproblem, Friedrich- Alexander-Universitat Erlangen- Numberg. Erlangen-Numberg 2014 [19] Martinovsky, F. u. Wagner, P.: Gesichtserkennung mit Eigenfaces. http://www.bytefish.de/pdI7eigenfaces.pdf, accessed on: 16.02.2016 [20] Moser, M. (Hrsg.): Messtechn ik der Akustik. Berlin: Springer 2010 [21] OpenCV: Cascade Classifier Training — OpenCV 2.4.13.2 documentation. http://docs.opencv.org/24. 13. 2/doc/user_guide/ug_traincascade.html, accessed on: 31.03.2017 [22] OpenCV: OpenCV: Face Detection using Haar Cascades. http://docs.opencv.Org/3. 1.0/d7/d8b/tutorial_py_face_detection.html, accessed on: 16.02.2017 [23] OpenCV: OpenCV: Face Recognition with OpenCV. http ://docs. opencv. org/3 . 1 . 0/da/d60/tutorial_face_main.html#tutorial_face_lbph, accessed on: 16.02.2017 [24] OpenCV Community, 2014. http://answers.opencv.org/question/26188/eigenface- algorith-can-be-improved/, accessed on: 01.03.2017 [25] Reschke, J.: Implementation of a Steered Response Power Acoustic Camera, Ostbayerische Technische Hochschule Regensburg project report. Regensburg 2016 [26] Reschke, J.: Aufbau und Test eines mehrkanaligen Audioaufnahmesystems far eine akustische Kamera, Ostbayerische Technische Hochschule Regensburg project report. Regensburg 2016 [27] RME: Bedienungsanleitung OctaMic XTC. The Professional's Multiformat Solution. http://www.rme-audio.de/download/octamicxtc_d.pdf, accessed on: 08.06.2016 [28] Rodriguez, Y.: Face Detection and Verification using Local Binary Patterns, Ecole polytechnique federate de Lausanne Promotionsschrift. Lausanne, Schweiz 2006 [29] Rozgic, V., Busso, C., Georgiou, P. G. u. Narayanan, S.: Multimodal Meeting Monitoring: Improvements on Speaker Tracking and Segmentation through a Modified Mixture Particle Filter. In: Multimedia Signal Processing, 2007. MMSP 2007. IEEE, pp. 60-65 [30] Scholte, R., Roozen, B. u. Lopez, I.: On Spatial Sampling and Aliasing in Acoustic Imaging. In: Twelfth International Congress on Sound and Vibration. 2005 [31] Sigl, J. u. Scheucher, R.: Acoustic Imaging of Sound Sources with a student-designed Acoustic Camera. AMERICAN JOURNAL OF UNDERGRADUATE RESEARCH 6 (2007) 2 [32] Stieler, W.: Die Schutzbrille. Technology Review 02 (2017), pp. 82-83 [33] Szeliski, R.: Computer vision. Algorithms and applications. Texts in computer science. London: Springer 2011 [34] Turk, M. u. Pentland, A.: Eigenfaces for Recognition^. Journal of Cognitive Neuroscience 3 (1991) 1 [35] Viola, P. u. Jones, M.: Rapid Object Detection using a Boosted Cascade of Simple Features. In: ACCEPTED CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION 2001 [36] Willow Garage: OpenCV Face Module. OpenCV 2015