Maryam Harakat (359826), Mohamed Taha Guelzim (355739), Muhammad Mikail Rais (402800)
COM-304 Final Project Report
This project implements a real-time ball detection and tracking system for a mobile robot using the OAK-D Lite stereo camera and the ROS 2 Humble middleware. The systemleverages on-device YOLO inference provided by DepthAI, stereo depth data, and odometry feedback to estimate the relative position of a ball in 2D space. The main output of the system is a boolean flag indicating detection, and a position vector consisting of the estimated angle and distance to the ball. This data can be used for downstream navigation or manipulation tasks.
This project aims to develop an autonomous robot capable of navigating through obstacle-filled environments to locate, push, and deliver a predefined object to a moving human recipient. Unlike traditional navigation tasks, this requires dynamic target tracking and human interaction. The key challenge is designing a solution that doesn’t require specialized hardware, such as robotic arms or custom pushing mechanisms. This raises important research questions:
Solving this problem is vital for advancing autonomous robots in human-centric applications, such as assisting elderly or disabled individuals, automating package delivery, and streamlining warehouse logistics. By eliminating the need for hardware modifications, our solution enables easy deployment on existing platforms, making it adaptable and scalable. This research addresses the broader question:
Ball detection and tracking is a fundamental problem in mobile robotics, particularly in applications involving object following, pick-and-place, or navigation toward targets. This project integrates real-time object detection, depth perception, and motion compensation to track a red sports ball using the OAK-D Lite camera and a TurtleBot 4 Lite running ROS 2 Humble. Hence, the primary objectives in this projects were:
DepthAI is an embedded AI vision platform enabling stereo depth and neural inference on-device via the Myriad X processor. Our system uses DepthAI’s YoloSpatialDetectionNetwork to perform real-time object detection and depth estimation on-device.
The objective of this work—detect and track objects in real time using a mobile robot—is similar to our project goal, although their work relies on traditional vision methods (color segmentation, optical flow) and CPU-based processing. We improved robustness and speed by using depth sensing and YOLO, leading to more accurate and lower-latency tracking.
Mur-Artal et al (2015) presented their solution, ORB- SLAM, which uses visual data along with odometry to build accurate spatial understanding over time.
To enable autonomous space exploration, our system integrates Simultaneous Localization and Mapping (SLAM) with custom frontier detection and navigation strategies. SLAM is a core technique that allows the robot to construct a map of an unknown environment while concurrently estimating its own position within that map. In our implementation, we utilized a SLAM algorithm in conjunction with ROS2 and RViz, enabling the construction and real-time visualization of a 2D occupancy grid. This grid map is generated using LiDAR data and categorizes the environment into free space, obstacles, and unknown regions. The exploration process begins with the initial mapping phase, where SLAM builds a foundational representation of the environment. A custom algorithm is then applied to detect frontiers—boundaries between known and unexplored areas of the map.
Once identified, frontiers are evaluated based on specific selection criteria. The size of each frontier is assessed, with preference given to larger frontiers that potentially open access to broader unexplored areas. After a frontier is selected, the robot navigates to it using the Nav2 navigation stack. Navigation is supported by a costmap, which incorporates obstacle data to ensure safe and collision-free movement.
To help debugging and enhance situational awareness, we created a secondary map that visualizes waypoints and distinguishes between accessible and non-accessible points. This secondary map provides critical insight into the robot’s decision-making process during exploration and navigation.
A notable achievement in our project is that we managed to run YOLO inference natively on the Turtlebot camera. This enabled the robot to do real-time object detection while simultaneously run SLAM to map the environment. Firstly, we installed DepthAI library on the Turtlebot’s Raspberry Pi. DepthAI is an open-source framework developed by Luxonis that provides a way to interface with DepthAI hardware, one of which is the OakD Lite camera on Turtlebot. Because DepthAI devices use Intel’s Myriad X VPU, it only runs models compiled into a special binary format called .blob which stands for binary large object, but in the DepthAI world, it is a compiled model ready to run on-device. It is generated from standard model formats like ONNX, TensorFlow, or OpenVINO IR using a tool called blobconverter. Luxonis has an open model zoo that contains various object detection models, among them are YOLOv8n and YOLOv7. We had compared the performance and reliability of both model and decided to use YOLOv8n.
The DepthAI framework provides a convenient Python API to interact with the camera and run inference. We used the YoloSpatialDetectionNetwork class to create a pipeline that captures images, runs YOLO inference, and outputs detection results with spatial coordinates. The pipeline also provides depth information for each detected object, allowing us to estimate the 2D position of the ball in the robot’s coordinate frame.
The ball detection module runs continuously, capturing images and processing them in real-time.
1. Locating Ball Target: During navigation, the robot captures images at regular intervals as it moves through the environment. Once the robot reaches a selected frontier point (as given by the Nav2 navigation stack feedback), it performs a 360-degree rotation to further enhance its ability to detect any balls in its surroundings. This rotation ensures that the robot scans the entire environment, maximizing the likelihood of spotting a ball from any angle.
2. Acquiring Ball Target: Upon detecting the ball, we infer its position in the map. In the previous code snippet, we saw how we could get the angle of the ball’s center relative to the camera frame and the distance from the robot to the ball. Additionally, odometry correction is applied to account for the robot’s movement and ensure accurate positioning.
Once the ball’s coordinates are established, the robot uses the Nav2 navigation stack again to plot a direct path to the ball’s location. This ensures that the robot efficiently moves toward the ball, adjusting its trajectory as necessary based on its real-time localization and the environment.
As explained above, our project is composed of three main components: autonomous exploration of the environment, visual recognition of a colored ball, and navigation toward the detected ball. Each of these components addresses a fundamental robotics task, and each brings its own technical challenges. To develop a robust and reliable system, we adopted a modular and hypothesis-driven approach: we first implemented and tested each module independently, under controlled conditions, and then integrated them for full pipeline testing in diverse real-world environments. We deliberately chose to test each component in isolation before integration for two main reasons. First, this allowed us to identify and debug issues specific to each sub-system without the confounding influence of the others. Second, we hypothesized that performance bottlenecks and failure cases would be easier to interpret when isolated, making debugging and improvements more efficient. This hypothesis was largely confirmed in practice: by testing components separately, we were able to refine frontier exploration strategies, improve visual recognition robustness, and tune navigation parameters without cascading failures.
The goal of the exploration module was to enable the robot to autonomously build a map of an unknown environment using SLAM and then use frontier-based planning to explore the entire space. We hypothesized that frontier-based exploration would generalize well across a variety of environments with different spatial structures. To test this, we deployed the robot in a wide range of real-world settings, including large open areas (such as the basement of the CO building), small rooms, classrooms, amphitheaters, and living spaces. We evaluated the quality of the exploration by visualizing the SLAM-generated maps in RViz and comparing them to the actual layout of the environment.
Our criteria included completeness of the exploration, the relevance of frontier points, and the consistency of the map structure. In early trials, the robot explored small and medium-sized rooms reliably, but failed in large spaces like the CO building basement—remaining in a small area and re-scanning already visited zones. This refuted our initial hypothesis that the exploration algorithm would generalize to all indoor environments. Debugging revealed the issue: the waypoint list wasn’t updated correctly as the map expanded, causing the robot to revisit outdated frontiers. We fixed this by (1) dynamically maintaining the waypoint list to remove stale entries and add new ones as the map grew, and (2) scaling the waypoint map to cover the full known environment.
After these changes, exploration succeeded even in large, complex areas, validating our original hypothesis—once the underlying bug was resolved.
For the recognition module, we chose to detect the ball using YOLO as stated before. We hypothesized that this method would be sufficient for real-time detection in indoor settings, and that lighting variations or partial occlusions would be the primary challenges. We first tested ball detection with the robot’s camera detached and stationary. We placed the ball in various positions: in plain sight, partially hidden, far from the camera, and close to cluttered objects.These initial tests confirmed that our recognition pipeline worked reliably under controlled lighting. However, when we reattached the camera to the moving robot and re-ran the same tests, we encountered new challenges. Motion blur and angle variations introduced noise into the detection pipeline. In the first implementation, during the 360° image capture, there wasn’t enough time of stabilization, photographs were therefore taken while the camera was partially in motion, and the resulting frames exhibited pronounced blur. The practical consequence was a drop in ball-detection recall during early scans. We hypothesized that the blur was caused not by sensor noise or lighting but by insufficient settling time before each shot. To test this, we inserted a short “settle 1 s-and-shoot” pause: the robot halts for 1 second, leaving time for frame capture, and only then advances to the next yaw increment. Subsequent experiments confirmed the hypothesis. Like this, we could correctly compute the relative coordinates of the ball when detected.
The navigation component is responsible for planning and executing a path from the robot’s current position to the detected ball. Our hypothesis here was that once the ball is successfully localized in space, the existing path-planning algorithm (based on the robot’s SLAM map and obstacle avoidance) would be sufficient to approach the target reliably. To test this, we placed the ball in various positions within the robot’s field of view and initiated the navigation routine. The ball was placed in open areas, near walls, partially occluded, and among obstacles (but visible to the robot). The navigation system performed relatively well when the path to the ball was relatively clear and the SLAM map accurately reflected the environment. However, in cluttered settings or when the ball was placed very close to obstacles, the robot occasionally failed to compute a viable path or hesitated in its approach. By default, the cost map inflates obstacles—that is, it adds a safety buffer around them to prevent collisions.
However, in our case, the inflation radius was too large: objects appeared significantly bigger than they actually were, making narrow passages or cluttered environments effectively impassable in the planner’s view. As a result, when the ball was placed near obstacles, the robot often struggled to find a viable path, or took unnecessarily long detours. In practice, this made navigation toward the ball less reliable in constrained spaces. Fortunately, we identified this issue and have been able to reduce the cost and tweak with additionnal paramters from the NavPy config file.
After thoroughly validating each component on its own, we moved on to integrated testing of the complete pipeline. These end-to-end tests were performed in a variety of real-world environments, such as classrooms, amphitheaters, and corridors in the CO building. The aim was to assess the overall coherence of the system—specifically, whether the transition between exploration, detection, and navigation occurred smoothly and whether the modules could function together without interference. These integrated experiments also allowed us to observe cumulative system behavior, such as total execution time, responsiveness, and robustness under varying spatial and visual conditions.
A significant limitation of our project was persistent hardware and connectivity problems. We lost valuable development time due to issues such as missing wheels, malfunctioning LiDAR and camera sensors, and frequent failures in maintaining a stable connection with the robot. These setbacks delayed progress and prevented us from dedicating more time to advanced functionalities we had initially planned. If the hardware had functioned correctly from the outset, we could have implemented more robust features and carried out more extensive testing.
Our depth calculations were slightly inaccurate when images were captured while the robot was rotating. This was primarily due to motion blur and synchronization issues with odometry data. The robot's movement introduced inconsistencies between the captured image and the odometry feedback, leading to errors in the computed depth and position of the ball. To mitigate this, we should have averaged the computed positions from the last x images instead of relying solely on the first computed position. By aggregating data over multiple frames, we could have reduced noise and improved the reliability of the ball's estimated coordinates.
While the mapping algorithm performed well in terms of obstacle avoidance and constructing accurate maps, we did not observe substantial improvements in exploration behavior due to visual coverage mapping.
It seemed as though the algorithm was not fully leveraging the visual coverage data to optimize its exploration strategy. However, during our testing, the robot's 360-degree rotation behavior proved sufficient to cover the majority of the environment. This rotation allowed the robot to scan its surroundings effectively, ensuring that most areas were explored without requiring additional adjustments to the exploration algorithm.
Maryam was primarily responsible for designing and implementing the core exploration algorithm. She developed the initial logic allowing the robot to autonomously explore unknown environments and identify frontier regions for navigation. She also implemented the 360° camera sweep mechanism at exploration start-up, enabling the robot to rotate on the spot and capture panoramic visual data. Furthermore, she integrated the logic to periodically capture images every 3 meters during exploration, laying the foundation for coupling vision with mapping.
Mohamed Taha played a major role in debugging critical issues, including waypoint list maintenance, ensuring dynamic frontier updates as the SLAM map expanded, and resolving hardware connection problems between the robot and the controlling software stack. His contributions were essential to enabling robust and scalable exploration. He also worked on the development of the visual coverage map, a feature intended to track which areas had been seen by the camera, as distinct from those simply scanned by LiDAR.
Mikail was in charge of the ball recognition pipeline using a YOLO-based object detection model. He handled the integration of the visual detector with the robot’s camera feed, and implemented the logic to extract 2D image coordinates of the ball, which could later be projected into real-world navigation targets. His work was crucial to enabling the robot to visually detect and locate its goal object.
Maryam and Mohamed Taha jointly conducted extensive testing across diverse real-world environments, including classrooms, amphitheaters, hallways, and basements. Their tests helped validate the system’s robustness and uncover limitations under varying spatial and visual conditions.
While we were unable to implement the extension of delivering objects to a human recipient due to time and hardware constraints, we identified a practical and simplified approach to achieve this functionality within the limits of our robot's capabilities. The delivery process would be based on straight-path planning and simple object pushing, with continuous feedback correction.
This approach offers a lightweight and achievable path toward object delivery with limited hardware. Future work could refine the LIDAR-based correction strategy, explore partial human tracking pre-push, and evaluate delivery reliability under varied terrain and object shapes.
In this project, we successfully developed a mobile robot capable of autonomously navigating through complex environments to locate a given target object. The system integrates real-time ball detection, depth estimation, and odometry correction to achieve accurate localization and navigation. Our findings indicate that the combination of YOLO-based object detection and stereo depth sensing provides a robust solution for real-time target tracking, enabling the robot to effectively navigate towards the detected ball.
Future work should focus on enhancing the robustness of the navigation system, particularly in cluttered environments. This could involve refining the cost map parameters, improving the ball detection algorithm to handle occlusions and lighting variations, and integrating more advanced path-planning techniques. Additionally, exploring alternative hardware configurations and sensors could further improve the system’s performance and adaptability.
Overall, this project demonstrates the potential of combining hardware-native object detection with real-time depth sensing and odometry correction to enable autonomous navigation and manipulation tasks in robotics. The successful integration of these components highlights the importance of modular design and hypothesis-driven testing in developing robust robotic systems.