Zillow Group Machine Learning Use Cases & Architectural Patterns

Zillow Group Machine Learning Use Cases & Architectural Patterns
Slide Note
Embed
Share

In this presentation by Jasjeet Thind, Vice President of Data Science & Engineering at Zillow Group, explore a variety of machine learning use cases including personalization, social & content ad targeting, and deep learning. Discover architectural patterns for real-time scoring, APIs, data collection systems, and more.

  • Zillow Group
  • Machine Learning
  • Data Science
  • Architectural Patterns
  • Deep Learning

Uploaded on Mar 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Machine Learning & Data @ Zillow Group Jasjeet Thind Vice President, Data Science & Engineering @JasjeetThind

  2. Agenda Zillow Group Machine Learning Use Cases Architectural Patterns Models Machine Learning Pipeline Data Quality Free Zillow Data @JasjeetThind

  3. Zillow Group Build the world's largest, most trusted and vibrant home-related marketplace. @JasjeetThind

  4. Machine Learning Use Cases Personalization Social & Content Ad Targeting Demographics & Community Zestimate (AVMs) Business Analytics Premier Agent (B2B) Forecasting home price trends Mortgages Deep Learning @JasjeetThind

  5. Architecture Real-Time Scoring APIs (Python, Flask) Data Collection Systems (Java/Python/SQL) Ranking (Spark) Zillow Group Data Lake (AWS - S3 / Kinesis) Featurization (Spark) User Profiles (Spark / HBase) Wedge Counting Collaborative Filtering (Spark) Aggregate Features (Spark) @JasjeetThind

  6. Architectural Patterns Application (Backstage) [Process] Transport [Collect] Data Lake (AWS) [Store] Serving System / Analytics [Answer] Put object Analytics Get object Put object Kinesis Firehose Applicatio n (batch) ZG Data Lake (S3) Stream Database (Serving) Get object Applicatio n (near real-time) Get records Real-Time Scoring Put NRT records @JasjeetThind

  7. Machine Learning Models Random Forest K-means clustering Gradient Boosted Machines K-nearest neighbors CNN (Deep Learning) Wedge Counting NLP / TF-IDF / Word2vec / Bag of Words Linear Regression @JasjeetThind

  8. Like vs. Dislike Predict homes per user using behavior of similar users Feature Description uid unique id of user $19M pid Property id + ? Spencer Stan - first_visit timestamp or 0 + $22M num_views sigmoid(#views) - + time_spent time on page $664K num_contacts # leads sent num_saves # saves on zpid Like = user actively engaged with property num_shares # shares on zpid num_photos # photos viewed Dislike = user viewed property but weak engagement @JasjeetThind

  9. Wedge Count For all user & property pairs to form a prediction, perform wedge count - http://www.jmlr.org/proceedings/papers/v18/kong12a/kong12a.pdf Does Stan like $19M? Wedge # Spencer 3 + $19M (wedge03_cnt) + ? Stan $22M - Spencer 5 + $19M (wedge05_cnt) - ? Stan $664k @JasjeetThind +

  10. Gradient Boosting Classifier Normalize wedge counts for popular users / properties features wedge00_cnt wedge01_cnt wedge02_cnt wedge03_cnt wedge04_cnt wedge05_cnt wedge06_cnt wedge07_cnt wedge00_norm_cnt wedge01_norm_cnt wedge02_norm_cnt wedge03_norm_cnt wedge04_norm_cnt wedge05_norm_cnt wedge06_norm_cnt wedge07_norm_cnt Prediction for all user / property pairs Does Stan like the $19M home? features (uid: Stan, pid: $19M) (see right side) @JasjeetThind

  11. User Profile Signals - website, mobile app, and search queries Features (categorical variables) Bath 0_bath, 0.5_bath, 1_Bath, 1.5_bath, 2_bath, 2.5_bath, 3_bath Binary classification Bed 0_bed, 1_bed, 2_bed, 3_bed, 4_bed, 5_bed - labels (like/dislike) same as wedge count model Price 100_125_price, 125_150_price, 150_175_price pid uid features label Use Code condo, single_family, farm_land 0 or 1 (see right side) 0 or 1 Zipcode zip_98109 User profile model determines preference scores 0_bed: 0 1_bed: 0.01 2_bed: 0.8 3_bed: 0.6 @JasjeetThind

  12. Ranking Property matrix feature space same as user profile Dot product of property matrix with user profile vector Linear regression with additional features (e.g. age decay) 0_bed 1_bed 2_bed 3_bed uid_0 1 0 0 0 0 0 pid_0 0 0 1 0 0.01 0.8 pid_1 (uid, pid) score = 1 0 0 0 0.8 0 pid_2 {"uId":"10307499", "pId":"1044183744"} 0.3364 0 0 0 1 0.6 0.6 pid_3 @JasjeetThind

  13. Machine Learning Pipeline Collect user behavior and real-estate data, train the various models, generate the candidate set, and and make predictions. Recommend ations Hashmap (Redis) Spark job creates Hive table with user events (uid, pid) partitioned by date Wedge Counting / User Profile Models pid -> uid reverse index Past and current user events User Behavior (Kinesis /S3) User Store Score (Spark) Models (Python) Event API (Java) Filter (Spark) Wedge features or property features (user profile) Public Record (Kinesis / S3) Producer (Python) Train Models (Spark) Training Data (Spark) Training Set (S3) Property Data Active Listings (Kinesis / S3) Producer (Python) Scoring Data (Spark) Scoring Set (S3) Listing Data @JasjeetThind

  14. Data Quality Analytical pipelines that measure Data integrity Attributes / outlier detection Missing data Expected # of records Latency Models - expected data Build reports / alerts that drive action @JasjeetThind

  15. Free Zillow Data Zillow.com/data Time Series: national, state, metro, county, city and ZIP code levels ZTRAX: Zillow Transaction and Assessment Dataset Previously inaccessible or prohibitively expensive housing data for academic and institutional researchers FOR FREE. Zillow Home Value Index (ZHVI) Top / Middle / Bottom Thirds Single Family / Condo / Co-op Median Home Value Per Sq Ft Zillow Rent Index (ZRI) More than 100 gigabytes 374million detailed public records across more than 2,750U.S. counties 20+years of deed transfers, mortgages, foreclosures, auctions, property tax delinquencies and more for residential and commercial properties. Assessor data including property characteristics, geographic information, and prior valuations on approximately 200 million parcels in more than 3,100 counties. Multi-family / SFR / Condo / Co-op Median ZRI Per Sq ft Median Rent List Price Other Metrics Median List Price Price-to-Rent ratio Homes Foreclosed For-sale Inventory / Age Inventory EmailZTRAX@zillow.comfor more information Negative Equity And many more @JasjeetThind

  16. Thank you! Related Blogs Zillow.com/data-science Trulia.com/blog/tech/ Hiring Machine Learning Engineer Data Scientist Product Manager Data Engineer @JasjeetThind

Related


More Related Content