# Atabak Mardan, PhD - Full Content Catalog > This file contains the complete consolidated resume, projects list, publication details, and blog post content of Atabak Mardan, PhD, optimized for offline LLM parsing and context injection. --- ## 1. Resume & Experience ### Work Experience #### Data Scientist III | Choice Hotels International (Feb 2023 - Present) Deploying ML, GenAI, and analytics products on AWS utilizing customer databases, partnering cross-functionally with marketing and sales: - **Sales Lead Scoring Platform:** Engineered an LLM/ML pipeline to rank B2B opportunities. Delivered via Tableau/Salesforce, cutting research time for 6,000 properties from hours to minutes. - **Customer Lifetime Value Model:** Constructed an XGBoost CLV scoring model on 50M+ transactions, increasing segment accuracy by 25% to optimize $100M+ in fee decisions. - **External Data Matching Pipeline:** Created a multi-stage entity-resolution engine utilizing fuzzy strings and LLM vector similarity on AWS Glue, boosting matching accuracy from 50% to 96%. - **Climate Risk Dashboard:** Synthesized multi-source environmental data (charging, solar, climate) into property KPIs with LLM-generated summaries via AWS Bedrock. #### Lead Transportation Engineer & Network Analyst | ICF International (Sep 2021 - Feb 2023) Delivered advanced analytics and geospatial pipelines for federal and state DOT clients (FHWA, NYSDOT, VDOT): - **National Transit Inventory:** Designed a Python/GIS quality-audit system for GTFS transit data across the FHWA national inventory. - **Accessibility Modeling:** Built isochrone models to assess walking/biking access gaps for NYSDOT transit corridors in the Bronx. - **Travel-Time Reliability:** Developed and calibrated public travel reliability metrics in R for Virginia's Office of Intermodal Planning. #### Transportation Systems Modeler & Data Specialist | C&M Associates (Aug 2019 - Sep 2021) Managed simulation pipelines, pricing forecasting, and demand analysis for toll roads: - **Revenue Forecasting:** Built long-term demand and traffic forecasts for major infrastructure projects, including the I-495 extension. - **Sensitivities & Calibration:** Modeled Dulles Toll Road 30-year pricing scenarios using calibrated Python/Excel algorithms. #### Graduate Research Assistant | George Mason University (Aug 2017 - May 2020) Cleaned raw sensor database feeds and developed a VDOT Tableau traffic visualization dashboard from scratch, implementing sensor reliability indicators. ### Education - **PhD, Transportation Systems** (George Mason University, 2025). Dissertation topic: Modelling and optimization of dynamic pricing toll roads (Express Toll Lanes). - **MSc, Data Analytics Engineering** (George Mason University, 2019). - **MSc, Transportation Systems** (Middle East Technical University, 2015). --- ## 2. Selected Publications 1. **Expressed Toll Lane Pricing Policy** - **Authors:** A Mardan, Z Qi, Z Zhu, S Zhu - **Venue:** Transportmetrica A: Transport Science 22 (1), 2368003 (2026) - **Citations:** 2 - **Google Scholar:** https://scholar.google.com/citations?view_op=view_citation&hl=en&user=6oyskuoAAAAJ&citation_for_view=6oyskuoAAAAJ:YsMSGLbcyi4C - **Summary:** This research models the complex relationship between route choice behavior and travel time reliability using a generalized Bayesian traffic model. It introduces a theoretical framework to account for uncertainty in commuters' perception of travel times and validates it using real-world congestion datasets. 2. **Dynamic Pricing Toll Roads in the US** - **Authors:** AV Farias, S Zhu, A Mardan - **Venue:** Case Studies on Transport Policy 17, 101226 (2024) - **Citations:** 12 - **Google Scholar:** https://scholar.google.com/citations?view_op=view_citation&hl=en&user=6oyskuoAAAAJ&citation_for_view=6oyskuoAAAAJ:W7OEmFMy1HYC - **Summary:** Presents a comprehensive review and comparative analysis of Express Toll Lane (ETL) pricing algorithms and operational strategies in the United States, evaluating equity concerns, funding mechanisms, and pricing optimization under different demand profiles. 3. **Hidden Markov Modeling of Travel Behavior** - **Authors:** Z Zhu, S Zhu, L Sun, A Mardan - **Venue:** Transportmetrica A: transport science 20 (1), 2130731 (2024) - **Citations:** 14 - **Google Scholar:** https://scholar.google.com/citations?view_op=view_citation&hl=en&user=6oyskuoAAAAJ&citation_for_view=6oyskuoAAAAJ:Tyk-4Ss8FVUC - **Summary:** Introduces a high-order hidden Markov model (HMM) framework to capture temporal changes in travel behavior. By analyzing sequence data of traveler behaviors, the model uncovers latent states of travel mode choices and transition probabilities. 4. **Generalized Bayesian Traffic Reliability** - **Authors:** Z Zhu, A Mardan, S Zhu, H Yang - **Venue:** Transportation research part B: methodological 143, 48-64 (2021) - **Citations:** 49 - **Google Scholar:** https://scholar.google.com/citations?view_op=view_citation&hl=en&user=6oyskuoAAAAJ&citation_for_view=6oyskuoAAAAJ:UeHWp8X0CEIC - **Summary:** An academic contribution exploring traffic engineering, logistics planning, or public infrastructure optimization. Focuses on data-driven methodologies, pricing algorithms, and empirical transport policy evaluations. --- ## 3. Shipped Data Products & Projects ### D.C. Crime Intelligence Portal - **URL:** https://atabak.app/projects/dc-crime-intelligence-portal/index.html - **Tech Stack:** Leaflet.js, Chart.js, Python, Open Data DC APIs - **Description:** Real-time command dashboard processing Open Data DC APIs via parallelized statistical queries. Features dynamic density heatmaps, spatial clustering, and automated trend extraction for immediate tactical insights. ### Antigravity Hotel Competitors - **URL:** https://atabak.app/projects/antigravity-hotel-competitors/index.html - **Tech Stack:** Streamlit, Plotly, Google Maps API, Python - **Description:** Algorithmic pricing matrix replacing traditional comp-sets. Integrates Google Places spatial coordinates with Booking.com temporal pricing APIs and drive-time routing to generate real-time market indices. ### AI Anchor Console (For Fun) - **URL:** https://atabak.app/projects/ai-anchor-console/index.html - **Tech Stack:** Gemini API, Web Speech API, Vite, Javascript - **Description:** An active broadcast desk agent that scrapes news articles, balances rundown timings, and synthesizes continuous TV scripts tailored dynamically to the user's interest. ### SchoolScore - **URL:** https://atabak.app/projects/SchoolScore/ - **Tech Stack:** React 19, TypeScript, Tailwind v4, Leaflet, Recharts - **Description:** A customizable school ranking and scorecard dashboard for Fairfax County. Allows parents to adjust weights for performance (math, reading, science, history pass rates) and equity metrics (disadvantaged, disability, ESOL pass rates) to calculate custom, localized scores for elementary, middle, and high schools. ### Traffic Signal Project - **URL:** https://atabak.app/projects/traffic-signal-project-modern-ui/index.html - **Tech Stack:** Leaflet.js, Chart.js, HTML5/CSS3 - **Description:** Interactive traffic corridor performance and signal delay dashboard, modernizing an older Tableau study into a high-contrast web app showcasing traffic volume and road reliability metrics. ### Phantom Purple Command Center - **URL:** https://atabak.app/projects/phantom-purple-command-center/index.html - **Tech Stack:** HTML5, CSS3, JavaScript, Leaflet.js - **Description:** Real-time command center interface and tracking map for the Phantom Purple project. --- ## 4. Technical Insights (Articles) ### Article 1: Advanced Entity Resolution: Combining Fuzzy Matching and Vector Similarity Entity resolution—the task of identifying and linking records that refer to the same real-world entity across disparate datasets—is a classic data engineering challenge. This article outlines the architecture of a production pipeline built on AWS Glue to de-duplicate and match external vendor databases against a leading hospitality chain's inventory of over 60,000 global properties. #### The Problem The hospitality chain receives lodging records from multiple third-party vendors and partners. Linking these records to our internal Master Property Database is plagued by inconsistencies: mismatched abbreviations (e.g., "St." vs "Street"), phonetic typos, minor geolocation coordinates differences, or completely different brand naming conventions (e.g., "Clarion Inn Rockville" vs "Clarion Rockville"). A simple SQL name or coordinate join yielded a sub-par 50% match rate, leaving thousands of orphan records. #### The Multi-Stage Matching Architecture To solve this, I designed and implemented a multi-stage funnel algorithm executed inside a PySpark cluster. Rather than processing all combinations (which is computationally impossible at scale), we pass records through a cascading series of filters: > "By routing matches through a funnel of increasing computational complexity, we isolate difficult edge-cases for deep-learning vector models, while resolving 80% of matches using lightning-fast string distance comparisons." The funnel consists of four distinct layers: 1. **Stage 1: Deterministic Geohash Blocking:** We group hotels within a tight geospatial geohash boundary. This eliminates comparing properties that are thousands of miles apart. 2. **Stage 2: Token-Based TF-IDF String Matching:** Within each geohash block, we compare normalized string properties (Name, Address) using TF-IDF token vector cosine similarity. Easy matches are linked and exited from the pool. 3. **Stage 3: Fuzzy Distance Scoring (Jaro-Winkler & Levenshtein):** For remaining records, we run character-level distance metrics on addresses and names to resolve phonetic discrepancies and typos. 4. **Stage 4: LLM Semantic Embeddings (The Final Funnel):** For the remaining 5% of highly ambiguous records (where names and addresses look completely different but represent the same venue), we compute dense vector embeddings using an LLM model via AWS Bedrock (Titan/Claude). We perform a cosine similarity lookup on the vector coordinates. #### Spark UDF Implementation Example ```python # Spark UDF example computing fuzzy token set ratio from fuzzywuzzy import fuzz from pyspark.sql.functions import udf from pyspark.sql.types import DoubleType @udf(returnType=DoubleType()) def compute_fuzzy_ratio(name_a, name_b): if not name_a or not name_b: return 0.0 return float(fuzz.token_set_ratio(name_a, name_b)) / 100.0 ``` #### Business Impact This multi-stage framework successfully raised our match accuracy from 50% to 96%. The pipeline runs weekly on AWS Glue, handling automated scheduling and monitoring via Apache Airflow. By resolving these records, we unlocked millions of dollars in previously untracked market analytics and competitor intelligence insights. --- ### Article 2: Scaling CLV Models: Machine Learning over 50 Million Records Customer Lifetime Value (CLV) is the cornerstone metric for modern marketing strategy. Knowing how much a customer is expected to spend over their lifecycle allows businesses to optimize acquisition costs (CAC) and retention campaigns. Here, I break down the implementation details of an XGBoost-based CLV model trained on 50M+ transactional records at a leading hospitality chain. #### The Modeling Framework To predict customer value, we chose a hybrid approach. While classical probabilistic models (like the BG/NBD and Gamma-Gamma model) are highly effective for simple purchase frequencies, they struggle to capture rich features like search behaviors, loyalty status tiers, seasonal booking coefficients, and regional discounts. We designed an extreme gradient boosting model (XGBoost Regressor) optimized for long-tail distributions. The modeling pipeline involves three core steps: 1. **Feature Engineering:** We aggregated historical lodging transactions into Recency, Frequency, and Monetary (RFM) components, layered with dynamic features like hotel brand diversity, average booking lead-time, and mobile app usage metrics. 2. **Deduplication Pipeline:** Standard loyalty profiles suffer from profile duplication. We built a customer profile deduplication pipeline using geospatial and name matching to clean transaction histories before feeding them to the training set. 3. **Hyperparameter Tuning:** Given the size of the dataset (50M+ transactions), we utilized distributed Spark-based grid search (using Hyperopt) to optimize tree depth, learning rate, and regularizations, preventing over-fitting on extreme spenders. #### The Model Equation The CLV target is defined as the net present value of expected future cash flows over a 3-year horizon: > CLV_3Yr = Sum_t=1..3 [ (Expected Transactions_t * Average Order Value_t) / (1 + Discount Rate)^t ] #### Key Results Our model outperformed historical heuristic models, improving customer segment prediction accuracy by 25%. This high-precision classification system was immediately integrated into the hospitality chain's marketing campaign manager, directly informing targeting decisions on $100M+ in annual fee budgets and ensuring promotional discounts are sent only to high-churn-risk, high-value travelers.