Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Posts

Scalable influence and fact tracing for large language models pretraining

3 minute read

Published: November 07, 2025

Figure: Difference between the classical lexical retrieval and the influence based retrieval for large language models

Why Language Models Hallucinate: The Epidemic of Penalizing Uncertainty

3 minute read

Published: November 07, 2025

Figure: Binary grading makes “guess when unsure” optimal → higher hallucinations.
Confidence-aware grading (penalize wrong answers; allow IDK) makes abstention rational → lower hallucinations.

Teaching Humanoids Without MoCap: Inside TWIST2’s Portable Data Collection System

2 minute read

Published: November 05, 2025

Motivation

How do we collect humanlike motion data for robots without a $100K motion-capture studio?

What I Learned from Hackathons (and Losing One!)

less than 1 minute read

Published: October 29, 2025

Hackathons have been among the best learning experiences of my career.

5 Books That Changed How I Think About Machine Learning and Research

less than 1 minute read

Published: October 22, 2025

Books have shaped how I approach ML — not just as a technical field, but as a way of thinking.

What is Data Shapley? Measuring the True Value of Data

less than 1 minute read

Published: October 15, 2025

We often focus on model architectures — but what if the most valuable part of your ML system is your data?
Data Shapley assigns a contribution score to each training point, measuring its impact on model performance.

Enhancing Cybersecurity Risk Assessment using Temporal Knowledge Graphs

less than 1 minute read

Published: September 13, 2025

My recent publication in Decision Support Systems (Elsevier, 2025) focuses on temporal knowledge graph-based explainable DSS for cybersecurity.

Explaining SENE: Manifold Learning for Distracted Driving Analysis

less than 1 minute read

Published: April 15, 2023

My first research paper, published in Engineering Applications of Artificial Intelligence (2023), proposed SENE — a novel manifold learning technique for analyzing distracted driving.

hackathons

Data4Good 2025 Building Trust in Educational AI through Factuality Verification

Built an ensemble factuality-verification pipeline for educational AI responses, achieving 99.03% balanced accuracy on a held-out competition test set.
Aligned with UN SDG 4 through safer and more trustworthy AI-assisted learning.

Open IIT Data Analytics Sponsored by Brillio

Predicted popularity of 4000+ songs using ensemble models; secured 1st place out of 48 teams.

HackGT 12: Crypt of Data BackpackMate AI

Developed BackpackMate AI — a travel-planning phone agent built with the Mastra Framework and LLM-based retrieval pipelines.

portfolio

EV Charging Network Optimization Dashboard

Optimizing EV charging infrastructure across urban regions using spatial clustering and demand analysis.

publications

SENE: A novel manifold learning approach for distracted driving analysis with spatio-temporal and driver praxeological features Permalink

Published in Engineering Applications of Artificial Intelligence, 2023

Although many studies have been conducted on distracted driving, the growing number of accidents on roads demands further serious attention. Most real-world distracted driving data are unlabeled and high-dimensional, making analyses complex. There is a lack of proper indices to understand the perilousness of distracted driving, making it difficult to identify roads or neighborhoods with higher risk of accidents. Previous studies focused either on spatio-temporal or praxeological factors separately, but did not consider both together. Furthermore, crisp rule extraction and interpretation are largely missing in the literature.

Enhancing cybersecurity risk assessment using temporal knowledge graph-based explainable decision support system Permalink

Published in Decision Support Systems, 2025

Assessing cybersecurity policies is crucial for organisations to combat evolving cyber threats. The absence of comprehensive datasets has prevented prior studies from analysing cybersecurity policy risks. Past studies also neglected temporal information in policies, and attention-based analyses often lack automated determination of optimal attention units. Furthermore, the absence of interpretability in cybersecurity studies creates a barrier to understanding policy vulnerabilities and developing targeted solutions.

teaching

Teaching experience 1

Undergraduate course, University 1, Department, 2014

This is a description of a teaching experience. You can use markdown like any other post.

Teaching experience 2

Workshop, University 1, Department, 2015

This is a description of a teaching experience. You can use markdown like any other post.

Subhajit Bag

Sitemap

Pages

Posts

Motivation

hackathons

portfolio

publications

talks

teaching