Data Engineering + AI Systems

Nithin Kotha

Architecting end-to-end data engineering and AI infrastructure from ingestion to intelligent decision-making, building reliable data foundations and intelligent systems through thoughtful engineering and design.

Nithin Kotha
Open to roles
Azure DP-203 Certified
San Antonio, Texas, USA
4+ Years in Cloud
5+ Data Projects
M.S. Business Analytics
Azure Certified DP-203

From Data Foundation
to Intelligent Systems

I'm a data engineer focused on building data pipelines and intelligent systems.

My background includes internships in data engineering and analytics, where I built ETL pipelines, automated reporting workflows, and worked with Azure cloud services. My work experience spans designing data lake structures, building ADF and Synapse Analytics ETL pipelines, and orchestrating large-scale cloud migrations—from on-prem systems to modern cloud storage and analytics platforms.

Now I'm focused on AI applications—specifically RAG systems, intelligent agents, and production deployment patterns.

Technical Skills

Use Regularly

Azure Data FactorySynapse Analytics QLIKDatabricks Logic App & Function App Python SQL Delta Lake PySpark OpenAI API CI-CD Git & GitHub

Learning & Experimenting

Vector Databases ( CosmosDB, ChromaDB ) Azure AI Foundry RAG Systems Docker Terraform Agent Frameworks (AutoGen, LangGraph) FastAPI LangChainLangGraph

Next Up

Multi-agent orchestration patterns Production LLMOps System design for AI applications Vector search techniques (BM25, semantic hybrid) LLM fine-tuning & evaluation

Featured Work

AI · RAG

Document QA System

RAG-powered chatbot for querying documents

Built a question-answering system that lets users query document collections using natural language. The system processes PDFs and text files into 500-token chunks, generates embeddings using OpenAI API, stores vectors in ChromaDB for fast similarity search, retrieves relevant chunks and generates answers with GPT, and provides a simple Streamlit interface with source citations.

Results: Sub-second response times for most queries. Tested on 200+ documents. Cost: ~$10/month in API calls.

Key Learning: Chunk size significantly impacts retrieval quality. Hybrid search (keyword + semantic) worked better than vector search alone.

Python OpenAI API LangChain ChromaDB Streamlit
Data Pipeline

Sales Analytics Pipeline

Batch data pipeline with Azure services

Built an ETL pipeline to transform raw sales data into analytics-ready tables for business reporting. Architecture includes Azure Data Factory for daily data ingestion, Databricks for processing 100K+ records using PySpark, Delta Lake for storing transformed data with versioning, dbt for data modeling (star schema), and Power BI for visualization.

Impact: Automated manual reporting process (saved ~5 hours/week). Reliable daily updates instead of ad-hoc exports. Handles incremental loads efficiently.

Challenges: Initial load took 45 minutes—optimized to 12 minutes using partitioning. Had to handle late-arriving data and duplicates.

Azure (ADF, Databricks) PySpark dbt SQL Delta Lake
Observability

Live Azure Data Pipeline Monitoring

Real-time ETL pipeline observability system

Built a monitoring solution to track Azure Data Factory and Synapse Analytics pipelines in real-time, eliminating manual pipeline health checks. The system captures pipeline execution logs from Azure Data Factory and Synapse, sends logs to Azure Log Analytics Workspace, creates dashboards in Azure Portal for visualizing pipeline runs, tracks success rates, execution times, and failure patterns, and enables proactive alerts for pipeline failures.

Impact: Reduced incident detection time by 80%. Eliminated daily manual health check reviews. Enabled data-driven optimization of pipeline performance.

Azure (Data Factory, Synapse, Log Analytics) KQL Azure Portal
Automation

Automated Report Generator

Scheduled analytics reports using Airflow

Built a workflow automation system that generates weekly analytics reports and emails them to stakeholders. Airflow DAG runs every Monday morning, extracts data from Azure SQL Database, runs aggregations and trend analysis, generates PDF report with charts, and emails to distribution list automatically.

Results: Eliminated 2 hours of manual work per week. Consistent report format and timing. Easy to add new metrics to template.

Learning: Airflow has a learning curve but automation is worth it. Started simple with BashOperator, then moved to PythonOperator for complex logic.

Airflow Python SQL Pandas Plotly

Engineering Mindset

01

Performance as a Culture

Efficiency isn't an afterthought. Every Spark partition, every ADF trigger, and every SQL query is optimized for the highest possible throughput and lowest cost per unit.

02

Resilience by Design

I engineer for failure so the system doesn't. My architectures prioritize self-healing pipelines, detailed observability, and high-availability Failover patterns.

03

Architectural Scalability

A system isn't finished until it can scale 100x. I use Infrastructure-as-Code (Terraform) and modular pipeline design to handle enterprise-scale growth without manual overhead.

Professional Experience

Sep 2024 — Present
Kingston Technologies Inc
Texas, USA
Client: Marathon Petroleum

Azure Data Engineer

Building ETL/ELT data pipelines and ingestion frameworks for enterprise energy infrastructure, connecting 50+ data sources to cloud analytics.

  • Architected automated ADF and Databricks pipelines integrating 50+ heterogeneous sources including Oracle, SAP, Salesforce, and S3.
  • Built real-time replication frameworks using QLIK Replicate for high-throughput data movement across distributed cloud architectures.
  • Developed Python-native AI solutions for SharePoint ingestion, leveraging Document Intelligence for NLQ (Natural Language Querying).
  • Orchestrated live stream processing for IoT data using Event Hubs, Stream Analytics, and Serverless Logic Apps.
  • Managed large-scale cloud resources (Synapse, Cosmos DB, Function Apps) using Terraform (IaC) and automated DevOps CI/CD.
Databricks ADF QLIK Replicate Terraform Event Hubs
Jun 2024 — Aug 2024
Soft Cloud Tech Inc
Texas, USA
Client: Marathon Petroleum

Azure Data Engineer

Migrated legacy Oracle databases to Azure and built high-performance data pipelines for analytics.

  • Orchestrated the migration of legacy Oracle ecosystems to ADLS Gen 2 and Synapse Analytics using automated ADF integration.
  • Mastered complex ADF Orchestration (Lookup, ForEach, Until) and automated pipeline migration between environments via ARM Templates.
  • Developed 15+ high-performance PL/SQL stored procedures for data extraction and transformation in the Data Warehouse.
  • Implemented Kafka-Confluent streaming pipelines for incremental loading of Postgres data into scalable cloud storage.
Synapse Analytics Kafka Oracle Migration ARM Templates
Feb 2024 — May 2024
Soft Cloud Tech Inc
Texas, USA

Data Engineer Intern

Built data lake structures and Informatica ETL workflows for enterprise data integration.

  • Designed and implemented enterprise-scale Data Lakes to support rapid analytics and processing of big data.
  • Engineered Informatica PowerCenter workflows and dynamic mappings, leveraging IDQ (Informatica Data Quality) for data validation and quality checks.
  • Applied in-depth knowledge of Hadoop Architecture (HDFS, Resource Manager) to optimize transformations across hybrid platforms.
  • Developed PySpark and Bash scripts for seamless data Loading across on-premises and cloud targets.
Informatica PowerCenter PySpark IDQ Hadoop Ecosystem
Feb 2021 — Nov 2021
Datacipher Solutions Pvt Ltd
Hyderabad, India

Data Analyst Intern

Built SSIS ETL packages and data integrations between SAP, Oracle, and SQL Server.

  • Developed SSIS ETL packages reducing data loading latency by an average of 20% through structural optimizations.
  • Orchestrated data flows between SAP, SQL Server, and Oracle DBs using Oracle Data Integrator (ODI).
  • Implemented Slowly Changing Dimension (SCD) logic and comprehensive data lineage tracking for warehouse integrity.
  • Designed high-integrity Informatica mappings to seamlessly unify data from diverse enterprise sources.
SSIS Oracle ODI Informatica SCD Logic

Interested in what I build?

Open to senior data engineering roles and technical collaboration.