Take Home Exercise 3

Author

Cheryl Jeanne Chiew

Published

February 5, 2023

Modified

March 24, 2023

1. Introduction

Take-home Exercise 3 will seek to uncover the salient patterns of the resale prices of public housing property by residential towns and estates in Singapore by using appropriate analytical visualisation techniques learned in Lesson 4: Fundamentals of Visual Analytics. The dataset selected for this exercise is sourced from Data.gov.sg and contains data on the Resale Flat Prices based on Registration Date from Jan 2017 onwards.

For the purpose of this study, only 3-ROOM, 4-ROOM and 5-ROOM flat types will be used for analysis. The study period is conducted based on transactions recorded in 2022.

2. Data Visualization

Installing and Loading R Packages

pacman::p_load(readr, dplyr, tidyverse, ggstatsplot, performance, parameters, ggthemes) 

Importing Data

resale_data <- read_csv("data/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")

Data Preparation

#<< Data Wrangling of Dataset
resale_filtered <- resale_data %>%                                                #<< filter only for 3 / 4 / 5 Room Flat Types
  filter(flat_type == "3 ROOM" | 
           flat_type == "4 ROOM" | 
           flat_type == "5 ROOM") %>%
  
  filter(month < 2023 & month >= 2022) %>%                                        #<< filter only for transactions in year 2022
  
  separate(flat_type,c("flat_type","NA1"), sep=" ") %>%                           #<< Transform flat_type: only retain first character for flat_type
  select(-c(NA1)) %>%
  
  separate(storey_range,c("storey_low","storey_high"), sep=" TO ") %>%            #<< Get median floor based on storey range
  mutate(storey_mean = ((as.numeric(storey_high)-as.numeric(storey_low))/2) +
           as.numeric(storey_low)) %>% 
  select(-c(storey_low,storey_high)) %>%
  
  separate(remaining_lease,                                                       #<< split remaining_lease column with 'space' as a delimiter
           c("remaining_lease","NA1","remaining_months","NA2"), sep = " ") %>% 
  
  mutate(remaining_months = ifelse(is.na(remaining_months), 0, remaining_months), #<< Converting delimited fields for remaining lease into numeric attribute
         remaining_months = as.numeric(remaining_months)/12,
         remaining_lease = as.numeric(remaining_lease)+remaining_months) %>%
 
  select(-c(NA1,NA2,remaining_months)) %>%                                        #<< drop NA1 / NA2 / remaining_months columns
 
  rename(txn_month = month) %>%                                                   #<< rename column for clarity

  separate(txn_month, c("txn_month","NA1"), sep = "-") %>%                        #<< split txn_month column with '-' as a delimiter
  
  mutate(NA1 = as.numeric(NA1)/12,                                                #<< Converting delimited fields for remaining lease into numeric
         txn_month = as.numeric(txn_month)+NA1) %>%
  select(-c(NA1))                                                                 #<< drop NA1 columns

#<< Derive Summarized Values: SD / Mean / Count
resale_grouped <- resale_filtered %>%  
  select(town, flat_type, resale_price) %>%              
  group_by(town, flat_type)  %>%                
  summarise(SD_Resale_Px=sd(resale_price),
            Mean_Resale_Px=mean(resale_price),
            No_of_txn=n(),
            .groups = 'drop')

#<< Left join back to main dataset to get full data
resale_filtered <- left_join(resale_filtered, 
                              resale_grouped,
                              by = c("town" = "town", "flat_type" = "flat_type"))

Analytical Visualization #1 - Box Violin Plot

Selection and Design Considerations

This visualization uses the ggbetweenstats() method in the ggstatsplot package to conduct a One-way ANOVA test on Resale Prices by Flat Type.

• A combination of box and violin plots along with jittered data points for between-flat type designs is used – side-by-side display of plots is able to give the user an immediate visual comparison of the differences in resale prices across flat types.

• Statistical details included in the plot as a subtitle gives the user quantitative validation of the One-way ANOVA test outcomes.

• Y-axis Resale prices has been rounded to thousands to facilitate the user’s ease of reading.

• Outlier label is turned on to identify the towns where outlier prices have been recorded.

Reproducible Description Data Viz Prep

The code chunk below contains the codes required to reproduce this visualization, along with the corresponding annotations which describes the procedures used.

set.seed(123)

resale_filtered1 <- resale_filtered %>% 
  select(resale_price, town, flat_type) %>%
  mutate(resale_price = resale_price/1000)

ggbetweenstats(
  data                 = resale_filtered1,                                              #<< dataframe
  x                    = flat_type,                                                     #<< grouping / independent variable
  y                    = resale_price,                                                  #<< dependent variables
  xlab                 = "Resale Price (in S$'000)",                                    #<< label for the x-axis
  ylab                 = "Flat Type",                                                   #<< label for the y-axis
  title                = "Comparison of Resale Price (in S$'000) by Flat Type (2022)",  #<< Plot Title
  caption              = "Source: Data.gov.sg",                                         #<< Caption
  type                 = "p",                                                           #<< type of statistical test: parametric
  mean.plotting        = TRUE,                                                          #<< whether the mean is to be displayed
  outlier.tagging      = TRUE,                                                          #<< whether outliers should be flagged
  outlier.coef         = 3,                                                             #<< outlier coeff - configured to display extreme outliers
  outlier.label        = town,                                                          #<< label to attach to outlier values
  outlier.label.color  = "red",                                                         #<< outlier point label color
  messages             = FALSE                                                          #<< turn off messages
) 

Insights Revealed by Data Viz

Resale prices are currently over-valued across all flat types.

From the plot, we are able to see that the distribution is positively skewed across all 3 flat types, where the mean > median. This means that the data constitutes a higher frequency of high valued resale transaction prices.

4-Room and 5-Room Flats are more over-valued

The mean, identified by the red dot in the plot, appears to be further away for 4-Room and 5-Room flats vs 3-Room flat, which implies the higher price premium imposed on the demand for bigger houses.

Resale prices across flat types is statistically different

Given that the p-value is smaller than the alpha level, the null hypothesis is rejected and we can statistically validate that there are indeed differences in the resale prices across flat types.

Pricing in of location premium

Looking that the extreme outlier labels displayed in the plot, we can also see how the location premium effect is priced into the resale transaction prices. For example, a 4 Room HDB flat in the central area can cost more than most 5 Room HDB flats.

Analytical Visualization #2 - Dot Plot

Selection and Design Considerations

This visualization uses the ggdotplotstats() method in the ggstatsplot package to conduct a Bootstrap-t method for one-sample test. The output is visualized in a dot plot using the Cleveland dot plot.

• Displaying in a dot plot makes the labels easier to read, as the towns are sorted in percentile terms, thereby reducing graph clutter.

• The mean labels are clearly defined, allowing the user a basis to easily identify overvalued/undervalued towns relative to the mean.

• Plots displayed side-by-side allows the user a visual comparison of the transacted prices relative to the transaction count across towns.

• Statistical details included in the plot as a subtitle gives the user quantitative validation of the one-sample test outcomes.

Reproducible Description Data Viz Prep

The code chunk below contains the codes required to reproduce this visualization, along with the corresponding annotations which describes the procedures used.

#<< selecting relevant fields for plot and rounding resale price to thousands
resale_filtered2 <- resale_filtered %>% 
  select(resale_price, town, flat_type, No_of_txn) %>% 
  mutate(resale_price = resale_price/1000) %>% 
  distinct()

#<< derive average txn prices and txn count by towns
resale_filtered3 <- resale_filtered %>%
  select(resale_price, town, flat_type, No_of_txn) %>%             
  group_by(town)  %>%                
  summarise(avg_Px=mean(resale_price),
            avg_Txn_Count=mean(No_of_txn),
            .groups = 'drop')  %>%
  mutate(avg_Px = avg_Px/1000) 


set.seed(123)

p1 <- ggdotplotstats(
            data         = resale_filtered2,                                              #<< transformed dataset
            y            = town,                                                          #<< selecting y-axis (i.e., town)
            x            = resale_price,                                                  #<< selecting x-axis (i.e., resale price)
            type         = "robust",                                                      #<< selecting statistical approach (Bootstrap-t method for one-sample test)
            title        = "Mean Resale Prices of 3/4/5 Rm HDB by Towns",                 #<< Plot title
            xlab         = "Resale Price (in S$'000)",                                    #<< x-axis label
            caption      = "Source: Data.gov.sg",                                         #<< Caption
            ggtheme      = theme_economist()                                              #<< setting theme of plot

)

p2 <- ggdotplotstats(
            data         = resale_filtered2,                                              #<< transformed dataset
            y            = town,                                                          #<< selecting y-axis (i.e., town)
            x            = No_of_txn,                                                     #<< selecting x-axis (i.e., no. of txn)
            type         = "robust",                                                      #<< selecting statistical approach (Bootstrap-t method for one-sample test)
            title        = "No. of 3/4/5 Rm HDB Transactions by Towns",                   #<< Plot title
            xlab         = "Transaction Count",                                           #<< x-axis label
            caption      = "Source: Data.gov.sg",                                         #<< Caption
            ggtheme      = theme_economist()                                              #<< setting theme of plot
) 

p3 <- ggscatterstats(
            data         = resale_filtered3,                                              #<< transformed dataset
            y            = avg_Px,                                                        #<< selecting y-axis (i.e., avg_Px)
            x            = avg_Txn_Count,                                                 #<< selecting x-axis (i.e., avg_Txn_Count)
            xlab         = "Average Transaction Count",                                   #<< x-axis label
            ylab         = "Resale Price (in S$'000)",                                    #<< y-axis label
            marginal = FALSE,
            ggtheme      = theme_economist()                                              #<< setting theme of plot
)


(p1+p2)/p3

Insights Revealed by Data Viz

Towns with the highest location premium

The plot is able to clearly display towns with the highest location premium – Central Area, Bishan, Queenstown, Bukit Merah, etc.

Towns with higher demand

The plot is able to clearly display towns with higher demands (proxied by the number of transactions recorded within the towns in 2022) – Seng Kang, Punggol, Yishun, Woodlands, etc.

Expected Resale Prices to Pay

From the confidence internal, we are 95% confident that a given resale transaction price will fall between the range of $538k and $610k.

Medium Correlation between resale prices and transaction count

The degree of correlation Pearson’s Correlation Coefficient is -0.43, which indicates the correlation between resale prices and transaction count to be at a medium level. This is also observed in the plot where the reference line appears to be slightly tilted downwards.

Analytical Visualization #3 - Scatter Plot

Selection and Design Considerations

This visualization uses the ggscatterstats() method in the ggstatsplot package to evaluate the association between storey level (i.e., floor level of HDB unit) and resale prices.

• A scatterplot is used to depict the relationship between the 2 selected variables.

• A reference line, along with 95% CI cone has been included, which allows the user to visually identify the linear direction of the relationship.

• Statistical details have also been included in the plot itself.

• Distribution plots that depict the distribution of the individual attributes are also available to complement the main scatterplot.

Reproducible Description Data Viz Prep

The code chunk below contains the codes required to reproduce this visualization, along with the corresponding annotations which describes the procedures used.

#<< selecting relevant fields for plot and deriving average txn prices by towns
resale_filtered4 <- resale_filtered %>%
  select(resale_price, storey_mean, flat_type) %>%             
  group_by(storey_mean, flat_type)  %>%                
  summarise(avg_Px=mean(resale_price),
            .groups = 'drop')  %>%
  mutate(avg_Px = avg_Px/1000) 

set.seed(123)

ggscatterstats(
          data             = resale_filtered4,                                              #<< transformed dataset
          x                = avg_Px,                                                        #<< selecting x-axis (i.e., avg_Px)
          y                = storey_mean,                                                   #<< selecting y-axis (i.e., storey mean)
          ggtheme          = theme_grey(),                                                  #<< setting theme of plot
          xlab             = "Resale Price (in S$'000)",                                    #<< x-axis label
          ylab             = "Storey Level",                                                #<< x-axis label
          title            = "Mean Resale Prices of 3 / 4 / 5 Room HDB by Storey Level"     #<< Plot title
)

Insights Revealed by Data Viz

High Positive Association between Storey Level and Resale Prices

From the plot, it can be visually noted that the reference line appears to be clearly directed upwards. Complementing this observation with statistical metrics generated from Pearson’s correlation test (r = 0.82) revealed that, across all towns and flat types, the storey height was positively correlated with resale prices. The p-value (p < 0.05) was also confirmed the statistical details to be statistically significant.

Storey Level Premium Priced In

This is indicative that a storey level premium is also another factor priced into the HDB resale transaction prices (i.e., the higher the storey level of the HDB, the higher the resale transaction price is likely to be).

Analytical Visualization #4 - Multiple Linear Regression

Selection and Design Considerations

The Multiple Linear Regression model has been selected because of its simplicity and its performance, allowing us to easily understand how each variable affects the resale price. Several model diagnostic checks from the performance package have also been conducted and visualized. Finally, ggcoefstats() of ggstatsplot package is used to visualise the parameters of the regression model.

• Visualizing collinearity – plotting highly correlated variables using the check_collinearity() function allow the user to easily identify variables which may suffer from multi-collinearity, for the purposes of calibrating the model.

• The other model diagnostic checks have also been ran as a form of sanity check prior to building the finalized regression model.

• The regression model output is visualised using a dot-and-whisker plot, which allows the user to visually identify how each variable impacts the resale prices, along with other statistical details to complement.

Reproducible Description Data Viz Prep

The code chunk below contains the codes required to reproduce this visualization, along with the corresponding annotations which describes the procedures used.

Model Diagnostic: Multi-Collinearity Check

#<< lm() function used from Base Stats of R
#<< all attributes selected in first model version

model <- lm(resale_price ~ flat_type + floor_area_sqm + lease_commence_date + 
              remaining_lease + storey_mean + SD_Resale_Px + Mean_Resale_Px + No_of_txn,
            data = resale_filtered)

#<< running of collinearity check

check_collinearity(model) %>% plot()

New Model Diagnostic: Complete check

#<<dropped flat_type and lease_commence_date attributes
model1 <- lm(resale_price ~ #<<flat_type + 
              floor_area_sqm + #<<lease_commence_date + 
               remaining_lease + storey_mean + SD_Resale_Px + Mean_Resale_Px + No_of_txn,
            data = resale_filtered)

#<< running of model diagnostic checks
check_model(model1)

Visualizing Regression Model

ggcoefstats(model1, 
            output = "plot")

Insights Revealed by Data Viz

Removal of variables with high collinearity

Flat type and lease commencement date were identified to be variables with high collinearity, which may distort the model performance. As such, these were removed in the re-calibrated model.

Significant Attributes Contributing to the model

The remaining attributes are validated to be statistically significant (p-value < 0.05): floor_area_sqm, remaining_lease, storey_mean, SD_Resale_Px, Mean_Resale_Px, No_of_txn. The remaining lease and floor level of HDB unit appears to be top variables in determining the HDB resale prices. For instance, for every 1 year increase in remaining lease of the property, the resale price is expected to increase by ~$3698, while for every 1 level increase in floor level, the resale price is expected to increase by ~$4967.

Low AIC and BIC

The low AIC and BIC values are indicative of a good overall model fit.