Nithin Kotha

About

Building Data Systems
That Think

I'm a Data & AI Engineer who builds the infrastructure that turns raw data into operational intelligence.

Over the past three years, I've designed parameter-led ETL pipelines across 50+ enterprise sources, architected Qlik replication for 80+ databases, and built RAG-based AI systems that let teams query their own data using natural language — from refinery engineering docs to real-time Synapse diagnostics.

I care about systems that are not just functional, but reliable, scalable, and genuinely useful to the people who depend on them every day.

Career

Professional Experience

Sep 2025 — Present

Kingston Technologies Inc

Texas, USA

Client: Marathon Petroleum

Azure Data Engineer + AI Engineer

AI-Powered Data Intelligence and Enterprise-Scale Data Ingestion for Oil & Gas Infrastructure

Architected a RAG-based AI Agent for IT operational intelligence, enabling semantic search over Synapse, QLIK, and database logs for real-time workload anomaly detection.
Built a ReAct framework RAG architecture orchestrating cross-domain collaboration using Azure AI Foundry, LangChain, and OpenAI over engineering documentation.
Engineered parameterized ingestion frameworks in ADF and Synapse, orchestrating data movement from 50+ sources into Medallion Architecture curation.
Defined end-to-end Qlik replication topology for 80+ databases into ADLS Gen2, implementing CDC log-based strategies across multiple business domains.
Executed PySpark transformations in Synapse and Databricks with Delta Lake optimization and ACID guarantees for large-scale energy datasets.

Azure AI Foundry Azure AI Search LangChain ADF Synapse QLIK Delta Lake

Jun 2025 — Aug 2025

Soft Cloud Tech Inc

Texas, USA

Client: Marathon Petroleum

Azure Data Engineer

Operational Intelligence Hub: Event-driven Architecture and Real-time Streaming.

Architected a sub-second data serving layer using Cosmos DB & Synapse Link to handle high-velocity logistics telemetry and HTAP analytical processing.
Built Spark Structured Streaming pipelines in Databricks, processing 1M+ sensor events per hour from Event Hub into optimized Delta Lake storage.
Engineered Delta Live Tables (DLT) for sovereign Kafka stream processing, enabling near-real-time business alerts for operational KPI breaches.
Configured a high-availability Observability Matrix in Azure Monitor via KQL to track Cosmos DB throughput latency and Event Hub utilization.
Deployed Logic Apps & Function Apps for event-driven handles, triggering inventory validation and API integrations upon arrival of high-priority telemetry.

Cosmos DB Databricks Kafka DLT Event Hub KQL

Feb 2024 — May 2024

Soft Cloud Tech Inc

Texas, USA

Data Engineer Intern

Enterprise Migration Blueprint: Cloud Infrastructure, Security, and Foundation.

Orchestrated secure migration of legacy on-prem systems into ADLS Gen2 using ADF VNET-integrated private endpoints and dynamic template-led orchestration.
Provisioned Infrastructure-as-Code landing zones via Terraform for Synapse and Databricks with standardized RBAC security and audit logging.
Developed PySpark batch jobs in Synapse Spark Pools, implementing SCD Type 2 logic for the core enterprise data warehouse transition.
Established robust DevOps CI/CD to automate deployment of cloud artifacts across environments with mandatory security gate checks.
Leveraged Synapse Serverless SQL for cost-effective ad-hoc exploration of PB-scale data, accelerating insight delivery for Supply Chain teams.

ADF VNET Terraform Azure DevOps Synapse DW PySpark

Feb 2021 — Nov 2021

Datacipher Solutions Pvt Ltd

Hyderabad, India

Data Analyst Intern

Self-Service Modernization: BI Automation and Data Accuracy Utility.

Developed **Power BI dashboards** for sales and finance, consolidating disparate data sources into a single source of truth using DAX and RLS.
Created **Python automation toolkits** for cleansing legacy SAP exports, eliminating 20+ hours of manual weekly reporting effort.
Optimized **SQL scripts** on Oracle and SQL Server, improving reporting performance by 30% via query tuning for internal auditing.
Onboarded critical on-prem datasets to **Azure SQL Database**, establishing the first cloud-based reporting utility for the business units.

Power BI Python Azure SQL SQL Tuning Automation

Projects

Featured Work

AI · RAG

Document QA System

RAG-powered chatbot for querying documents

Built a question-answering system that lets users query document collections using natural language. The system processes PDFs and text files into 500-token chunks, generates embeddings using OpenAI API, stores vectors in ChromaDB for fast similarity search, retrieves relevant chunks and generates answers with GPT, and provides a simple Streamlit interface with source citations.

Results: Sub-second response times for most queries. Tested on 200+ documents. Cost: ~$10/month in API calls.

Key Learning: Chunk size significantly impacts retrieval quality. Hybrid search (keyword + semantic) worked better than vector search alone.

Python OpenAI API LangChain ChromaDB Streamlit

Data Pipeline

Sales Analytics Pipeline

Batch data pipeline with Azure services

Built an ETL pipeline to transform raw sales data into analytics-ready tables for business reporting. Architecture includes Azure Data Factory for daily data ingestion, Databricks for processing 100K+ records using PySpark, Delta Lake for storing transformed data with versioning, dbt for data modeling (star schema), and Power BI for visualization.

Impact: Automated manual reporting process (saved ~5 hours/week). Reliable daily updates instead of ad-hoc exports. Handles incremental loads efficiently.

Challenges: Initial load took 45 minutes—optimized to 12 minutes using partitioning. Had to handle late-arriving data and duplicates.

Azure (ADF, Databricks) PySpark dbt SQL Delta Lake

Observability

Live Azure Data Pipeline Monitoring

Real-time ETL pipeline observability system

Built a monitoring solution to track Azure Data Factory and Synapse Analytics pipelines in real-time, eliminating manual pipeline health checks. The system captures pipeline execution logs from Azure Data Factory and Synapse, sends logs to Azure Log Analytics Workspace, creates dashboards in Azure Portal for visualizing pipeline runs, tracks success rates, execution times, and failure patterns, and enables proactive alerts for pipeline failures.

Impact: Reduced incident detection time by 80%. Eliminated daily manual health check reviews. Enabled data-driven optimization of pipeline performance.

Azure (Data Factory, Synapse, Log Analytics) KQL Azure Portal

Automation

Automated Report Generator

Scheduled analytics reports using Airflow

Built a workflow automation system that generates weekly analytics reports and emails them to stakeholders. Airflow DAG runs every Monday morning, extracts data from Azure SQL Database, runs aggregations and trend analysis, generates PDF report with charts, and emails to distribution list automatically.

Results: Eliminated 2 hours of manual work per week. Consistent report format and timing. Easy to add new metrics to template.

Learning: Airflow has a learning curve but automation is worth it. Started simple with BashOperator, then moved to PythonOperator for complex logic.

Airflow Python SQL Pandas Plotly

Skills

Technical Skills

Use Regularly

Azure Data FactorySynapse Analytics QLIKDatabricks Logic App & Function App Python SQL Delta Lake PySpark OpenAI API CI-CD Git & GitHub

Learning & Experimenting

Vector Databases ( CosmosDB, ChromaDB ) Azure AI Foundry RAG Systems Docker Terraform Agent Frameworks (AutoGen, LangGraph) FastAPI LangChain

Next Up

Multi-agent orchestration patterns Production LLMOps System design for AI applications Vector search techniques (BM25, semantic hybrid) LLM fine-tuning & evaluation

Philosophy

Engineering Mindset

01

Build for Intelligence

Data has no value sitting in a lake. Every pipeline I build is designed to make information accessible, queryable, and actionable — whether it feeds a Power BI dashboard or a RAG-powered AI agent.

02

Engineer for Trust

If a pipeline fails silently, the data can't be trusted. I build systems with self-healing logic, automated alerting, and observability baked in — so teams can rely on the data without second-guessing it.

03

Think in Systems

A good pipeline solves today's problem. A great architecture handles tomorrow's scale. I design with modularity, reusable patterns, and Infrastructure-as-Code so systems grow without breaking.

Building Data Systems That Think

Professional Experience

Azure Data Engineer + AI Engineer

Azure Data Engineer

Data Engineer Intern

Data Analyst Intern

Featured Work

Document QA System

Sales Analytics Pipeline

Live Azure Data Pipeline Monitoring

Automated Report Generator

Technical Skills

Use Regularly

Learning & Experimenting

Next Up

Engineering Mindset

Build for Intelligence

Engineer for Trust

Think in Systems

Interested in what I build?

Building Data Systems
That Think