M Tech Dissertations
Permanent URI for this collectionhttp://drsr.daiict.ac.in/handle/123456789/3
Browse
Search Results
Item Open Access Semantic Segmentation Based Object Detection for Autonomous Driving(Dhirubhai Ambani Institute of Information and Communication Technology, 2023) Prajapati, Harsh; Maiti, Tapas KumarThis research focuses on solving the autonomous driving problem which is necessaryto fulfill the increasing demand of autonomous systems in today�s world.The key aspect in addressing this challenge is the real-time identification andrecognition of objects within the driving environment. To accomplish this, weemploy the semantic segmentation technique, integrating computer vision, machinelearning, deep learning, the PyTorch framework, image processing, and therobot operating system (ROS). Our approach involves creating an experimentalsetup using an edge device, specifically a Raspberry Pi, in conjunction with theROS framework. By deploying a deep learning model on the edge device, we aimto build a robust and efficient autonomous system that can accurately identifyand recognize objects in real time.Item Open Access Position Estimation of Intelligent Artificial Systems Using 3D Point Cloud(Dhirubhai Ambani Institute of Information and Communication Technology, 2023) Patel, Vraj; Maiti, Tapas KumarThe three-dimensional reality collected by various sensors such as LiDAR scanners,depth cameras and stereo cameras, is represented by point cloud data. Thecapacity of point clouds to provide rich geometric information about the surroundingsmakes them essential in various applications. Robotics, autonomouscars, augmented reality, virtual reality, and 3D reconstruction all use point clouds.They allow for object detection, localization, mapping, scene comprehension, andlightweight LiDAR SLAM has significant implications for various fields, includingrobotics, autonomous navigation, and augmented reality. Developing compactand efficient LiDAR SLAM systems makes it possible to unlock the potentialof lightweight platforms, enabling their deployment in a wide range of applicationsthat require real-time position mapping, and localization capabilities whileensuring practicality, portability, and cost-effectiveness.immersive visualization. Working with point clouds, on the other hand, presentssubstantial complications. Some primary issues are managing a vast volumeof data, dealing with noise and outliers, dealing with occlusions and missingdata, and conducting efficient processing and analysis. Furthermore, point cloudsfrequently necessitate complicated registration, segmentation, feature extraction,and interpretation methods, necessitating computationally costly processing. Addressingthese issues is critical for realizing the full potential of point cloud datain a variety of real-world applications.SLAM is a key technique in robotics and computer vision that addresses the challengeof estimating a robot�s pose and constructing a map of its environment. Itfinds applications in driverless cars, drones, and augmented reality, enabling autonomousnavigation without external infrastructure or GPS. Challenges includesensor noise, drift, and uncertainty, requiring robust sensor calibration, motionmodeling, and data association. Real-time speed, computing constraints, andmemory limitations are important considerations. Advanced techniques such asfeature extraction, point cloud registration, loop closure detection, and Graph-SLAM optimization algorithms are used. Sensor fusion, map representation, anddata association techniques are vital for reliable SLAM performance.The aim is to create a compact and lightweight LiDAR based SLAM that can beeasily integrated into various platforms without compromising on the accuracyand reliability of SLAM algorithms. Hence, we implemented a lightweight SLAMalgorithm on our dataset with various background situations and a few modificationsto the existing SLAM algorithm to improve the results. We have performedSLAM by using LiDAR sensor without the use of IMU or GPS sensor. TheItem Open Access Common Object Segmentation in Dynamic Image Collection using Attention Mechanism(Dhirubhai Ambani Institute of Information and Communication Technology, 2022) Baid, Sana; Hati, AvikSemantic segmentation of image groups is a crucial task in computer vision that aims to identify shared objects in multiple images. This work presents a deep neural network framework that exhibits congruity between images, thereby cosegmenting common objects. The proposed network is an encoderdecoder network where the encoder extracts high level semantic feature descriptors and the decoder generates segmentation masks. The task of cosegmentation between the images is boosted by an attention mechanism that leverages semantic similarity between feature descriptors. This attention mechanism is responsible for understanding the correspondence between the features, thereby determining the shared objects. The resultant masks localize the shared foreground objects while suppressing everything else as background. We have explored multiple attention mechanisms in 2 image input setup and have extended the model that outper forms the others for dynamic image input setup. The term dynamic image connotes that varying number of images can be input to the model, simultaneously, and the result will be the segmentation of common object from all of the input images. The model is trained end to end on image group dataset generated from the PASCALVOC 2012 [7] dataset. The experiments are conducted on other benchmark datasets as well and we can infer superiority of our model from the results achieved. Moreover, an important advantage of the proposed model is that it runs in linear time as opposed to quadratic time complexity observed in most works.Item Open Access Comparative Study: Neural Networks on MCUs at the Edge(2021) Anand, Harshita; Bhatt, AmitComputer vision has evolved excessively over the years, the sizes of the processor and camera shrinking, rising the computational complexity and power and also becoming affordable, making it achievable to be integrated onto embedded systems. It has several critical applications that require a Huge accuracy and vast real-time response in order to achieve a good user experience. The Neural network (NN) poses as an attractive choice for embedded vision architectures due to their superior performance and better accuracy in comparison to the traditional processing algorithms. Due to the security and latency issues which make larger systems unattractive for certain time-dependent applications, we require an always-on system; this application has a highly constrained power budget and needs to be typically run on tiny microcontroller systems having limited memory and compute capability. The NN design model must consider these above constraints. We have performed NN model explorations and evaluated the embedded vision applications including person detection, object detection, image classifications, and facial recognition on resource-constrained microcontrollers. We trained a variety of neural network architectures present in the literature, comparing their accuracies and memory/compute requirements. We present the possibility of optimizing the NN architectures in a way for them to be able to fit among the computational and memory criteria for the microcontroller systems without salvaging the accuracy. We also delve into the concepts of the depth-wise separable convolutional neural network (DS-CNN) and convolutional neural network (CNN) both of which are utilized in MobileNet Architecture. This thesis aims to present a comparative analysis based on the performance of edge devices in the field of embedded computer vision. The three parameters under major focus are latency, accuracy, and million operations, in this study.Item Open Access Image question and answering(Dhirubhai Ambani Institute of Information and Communication Technology, 2019) Mehta, Archan; Khare, ManishIn the past few years, Question Answering on Images has achieved a lot of attention from researchers. It is a must-researched topic, since it covers both the domain of Computer Vision and Natural Language Processing. Recent Advances have found out that it also covers Knowledge Representation and Reasoning. Datasets for Image Question Answering have been created since 2014, but all have their own drawbacks, since implementing it with different methods, one can get different accuracy. I have studied and analyzed different approaches made to solve the problem, and which technique is more efficient on a given datasets. Different datasets contains different type of question pairs and images. We have also analyzed about the different types of datasets used, and based on that also analyze which one is better and gives better results, and also see what future work is possible and how datasets as well as methods can be improved in the future.Item Open Access Locality preserving projection: a study and applications(Dhirubhai Ambani Institute of Information and Communication Technology, 2012) Shikkenawis, Gitam; Mitra, Suman KLocality Preserving Projection (LPP) is a recently proposed approach for dimensionality reduction that preserves the neighbourhood information and obtains a subspace that best detects the essential data manifold structure. Currently it is widely used for finding the intrinsic dimensionality of the data which is usually of high dimension. This characteristic of LPP has made it popular among other available dimensionality reduction approaches such as Principal Component Analysis (PCA). A study on LPP reveals that it tries to preserve the information about nearest neighbours of data points, thus may lead to misclassification in the overlapping regions of two or more classes while performing data analysis. It has also been observed that the dimension reducibility capacity of conventional LPP is much less than that of PCA. A new proposal called Extended LPP (ELPP) which amicably resolves two issues mentioned above is introduced. In particular, a new weighing scheme is designed that pays importance to the data points which are at a moderate distance, in addition to the nearest points. This helps to resolve the ambiguity occurring at the overlapping regions as well as increase the reducibility capacity. LPP is used for a variety of applications for reducing the dimensions one of which is Face Recognition. Face Recognition is one of the most widely used biometric technology for person identification. Face images are represented as highdimensional pixel arrays and due to high correlation between the neighbouring pixel values; they often belong to an intrinsically low dimensional manifold. The distribution of data in a high dimensional space is non-uniform and is generally concentrated around some kind of low dimensional structures. Hence, one of the ways of performing Face Recognition is by reducing the dimensionality of the data and finding the subspace of the manifold in which face images reside. Both LPP and ELPP are used for Face and Expression Recognition tasks. As the aim is to separate the clusters in the embedded space, class membership information may add more discriminating power. With this in mind, the proposal is further extended to the supervised version of LPP (SLPP) that uses the known class labels of data points to enhance the discriminating power along with inheriting the properties of ELPPItem Open Access Fingerprint image preprocessing for robust recognition(Dhirubhai Ambani Institute of Information and Communication Technology, 2012) Munshi, Paridhi; Mitra, Suman KFingerprint is the oldest and most widely used form of biometric identification. Since they are mainly used in forensic science, accuracy in the fingerprint identification is highly important. This accuracy is dependent on the quality of image. Most of the fingerprint identification systems are based on minutiae matching and a critical step in correct matching of fingerprint minutiae is to reliably extract minutiae from the fingerprint images. However, fingerprint images may not be of good quality. They may be degraded and corrupted due to variations in skin, pressure and impression conditions. Most of the feature extraction algorithms work on binary images instead of the gray scale image and results of the feature extraction depends upon the quality of binary image used. Keeping these points in mind, image preprocessing including enhancement and binarization is proposed in this work. This preprocessing is employed prior to minutiae extraction to obtain a more reliable estimation of minutiae locations and hence to get a robust matching performance. In this dissertation, we give an introduction to the ngerprint structure and identification system . A discussion on the proposed methodology and implementation of technique for fingerprint image enhancement is given. Then a rough-set based method for binarization is proposed followed by the discussion on the methods for minutiae extraction. Experiments are conducted on real fingerprint images to evaluate the performance of the implemented techniques.Item Open Access Back-view based visual hand gesture recognition system(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Sharma, Harish; Banerjee, AsimGesture recognition is a fascinating area of research due to its applications to HCI (human computer interaction), entertainment, and communication between deaf/ mute people etc. Gesture can be dynamic or static depending upon the application. Static gestures can be called postures. Dynamic gestures are collection or sequence of postures. Our method is an attempt to classify various postures in American Sign Language (ASL) for a wearable computer device like “Sixth Sense” (developed at MIT media lab) [17]. We are working with new set of features including verticalhorizontal histogram of a posture-shape. We are using Linear Discriminant Analyzer (LDA) Classifier for the purpose of classification. Also, our work is an attempt to raise some issues regarding the kind of problem that can rise during posture-shape recognition and how a simple classification technique with a new feature set can give fairly good results.Item Open Access Human action recognition in video(Dhirubhai Ambani Institute of Information and Communication Technology, 2011) Kumari, Sonal; Mitra, Suman K.Action recognition is a central problem in computer vision which is also known as action recognition or object detection. Action is any meaningful movement of the human and it is used to convey information or to interact naturally without any mechanical devices. It is of utmost importance in designing an intelligent and efficient human–computer interface. The applications of action recognition are manifold, ranging from sign language through medical rehabilitation to virtual reality. Human action recognition is motivated by some of the applications such as video retrieval, Human robot interaction, to interact with deaf and dumb people etc. In any Action Recognition System, a video stream can be captured by using a fixed camera, which may be mounted on the computer or somewhere else. Then some preprocessing steps are done for removing the noise caused because of illumination effects, blurring, false contour etc. Background subtraction is done to remove the static or slowly varying background. In this thesis, multiple background subtraction algorithms are tested and then one of them selected for action recognition system. Background subtraction is also known as foreground/background segmentation or background segmentation or foreground extraction. These terms are frequently used interchangeably in this thesis. The selection of background segmentation algorithm is done on the basis of result of these algorithms on the action database. Good background segmentation result provides a more robust basis for object class recognition. The following four methods for extracting the foreground which are tested: (1) Frame Difference, (2) Background Subtraction, (3) Adaptive Gaussian Mixture Model (Adaptive GMM) [25], and (4) Improved Adaptive Gaussian Mixture Model (Improved Adaptive GMM) [26] in which the last one gives the best result. Now the action region can be extracted in the original video sequences with the help of extracted foreground object. The next step is the feature extraction which deals with the extraction of the important feature (like corner points, optical flow, shape, motion vectors etc.) from the image frame which can be used for tracking in the video frame sequences. Feature reduction is an optional step which basically reduces the dimension of the feature vector. In order to recognize actions, any learning and classification algorithm can be employed. The System is trained by using a training dataset. Then, a new video can be classified according to the action occurring in the video. Following three features are applied for the action recognition task: (1) distance between centroid and corner point, (2) optical flow motion estimation [28, 29], (3) discrete Fourier transform (DFT) of the image block. Among these the proposed DFT feature plays very important role in uniquely identifying any specific action from the database. The proposed novel action recognition model uses discrete Fourier transform (DFT) of the small image block.For the experimentation, MuHAVi data [33] and DA-IICT data are used which includes various kinds of actions of various actors. Following two supervised recognition techniques are used: K-nearest neighbor (KNN) and the classifier using Mahalanobis metric. KNN is parameterized classification techniques where K parameter is to be optimized. Mahalanobis Classifier is non-parameterized classification technique, so no need to worry about parameter optimization. To check the accuracy of the proposed algorithm, Sensitivity and False alarm rate test is performed. The results of this tests show that the proposed algorithm proves to be quite accurate in action recognition in video. And to compare result of the recognition system confusion matrices are created and then compared with other recognition techniques. All the experiments are performed in MATLAB®.
Item Open Access Disparity estimation by stereo using particle swarm optimization and graph cuts(Dhirubhai Ambani Institute of Information and Communication Technology, 2010) Nahar, Sonam; Joshi, Manjunath V.Stereo vision is based on the process of obtaining the disparity from a left and a right view of a scene. By obtaining the disparity, we find the distance (depth) of each object point from the camera so that we can construct a 3-D form of a scene. A disparity map indicates the depth of the scene at various points. In this thesis we first discuss the local window based approaches like correlation window and adaptive window for finding the disparity map. These local approaches perform well in highly textured regions, non repetitive and in irregular patterns. However they produce noisy disparities in texture less region and fail to account for occluded areas. We then discuss the particle swarm optimization and graph cuts as global optimization techniques as the tools to obtain better estimates for the disparity map. These algorithms make smoothness assumption explicitly and solve the problem by minimizing the specified energy function. Particle swarm optimization, a bio inspired optimization technique is simple to implement but has high time complexity whereas graph cuts converges very fast yielding better estimates. In this thesis, we use rectified stereo pairs. This reduces the correspondence search to 1-D. To demonstrate the effectiveness of the algorithms, the experimental results from the stereo pairs including the ones with ground truth values for quantitative comparison is presented. Our results show that the disparity estimated using the graph cuts minimization performs better than the particle swarm optimization and local window based approaches in terms of quantitative measures with fast convergence.