Tuesday, April 28, 2015

Regression Analysis

Part 1:


Question: A study on crime rates and poverty was conducted for Town X.  The local news station got a hold of some data and made a claim that as the number of kids that get free lunches increase so does crime.  You feel this might a bit silly so you run a regression equation with the given data. Using SPSS to do you regression, determine if the news is correct?    What percentage of persons will get free lunch with a crime rate of 79.7?  How confident are you in your results?  Provide the data and a response addressing these questions.    

In this particular scenario, the null hypothesis states that there is no linear relationship between free lunches and crime rate.
On the other hand, the alternate hypothesis states that there is a linear relationship between the number of free lunches given out at schools and the crime rate. 


Figure 1 shows the results of the linear regression performed in SPSS. Using the Unstarndardized Coefficient values we are able to determine the regression equation. In this case the equation constant is equal to 21.819 and the regression coefficient is equal to 1.685. Therefore, based on the data provided, there does appear to be a positive linear relationship between between number of free lunches and crime rate. 

In order to determine the number of people who will be getting a free lunch if there is a crime rate of 79.7 we need to solve the regression equation for x as shown below.
y=a+bx
where,
y=79.7
a=21.819
b=1.685
So,
79.7=21.819+1.685x
and,
x= 34.351


Therefore, if there is a crime rate of 79.7 the expected number of people getting free lunches will be 34.351.


Figure 1  Results of the linear regression performed in SPSS. 



    Although coefficient of determination (R-squared) is extremely low in this case, which signifies that the model is not a good predictor of the what is actually occurring. However, when plotting the data on a scatterplot (Figure 2) one can see that the majority of have small residuals. There is one exception that causes the coefficient of determination to be be so small. If that particular sample had not been collected, or if more samples had been collected in total, the affects of that one outlier would not have played such a role in reducing the R-squared value (Central Limit Theorem). 

Figure 2  Scatterplot showing the residuals, strength of the model, and the regression equation of the trend line. 


Part 2:

Introduction

    It is important to determine where students are attending school and the factors that aide in their decision. This can be used by each university to try to increase attendance in certain areas. This exercise is aimed at determining significant variables that affect attendance at UW-Eau Claire and UW-Platteville. The main variables that are used include population/distance from each university, percent of each county with a BS degree, and median household income. One would think that all three of these variables play a large role in determining where students will choose to go to college. Significance or each variable is determined by running linear regression of the variables for the two schools. The residuals for each significant variable will then be mapped to show which areas have higher then expected attendance.


Methods

    Regression analysis will be used to determine which variables play a significant role in determining where students go for their undergraduate education. In this exercise several variables are provided. The variables provided include, the population attending each University of Wisconsin branch from each county, the distance of each county from the individual universities, the total population of each county, the percent population with a BS degree for each county, the number of residents between the ages of 18-24 per county, and the median household income for each county.

    The first in the regression analysis is to normalize the population data. This is done by dividing the total number of residents per county by the distance away from the individual university. This process will remove any bias for counties will large populations. Next, the regression analysis will be performed in SPSS. For two different schools (UW-Eau Claire and UW-Platteville), the population attending the univeristy will be set as the dependent variable and regression will be run for the population/distance, percent BS degree, and median household income variables (totaling six). This regression will provide details about which variables are significant at the 95% confidence level (Figure 1-6).

   After determining which variables play a significant role in university enrollment, another regression analysis will be performed. This time, the values of the residuals will be added to the tabular data. The table will then be exported into ArcMap. Finally, the residuals for the four significant variables will be mapped (Figure 7-10). 


Results

   After the regression analysis was performed on the three variables for each school, the variables that play a significant role are obvious. For the University of Wisconsin-Eau Claire, median household income has a significant value of .104 and is therefore not statistically significant based on a 95% confidence level (Figure 1). The percent population of each county with a BS degree does play a significant role in attendance at UWEC with a significance value of .003 (Figure 2). Finally, population/distance is highly significant in determining the enrollment at UWEC (Figure 3).


Figure 1  Regression analysis results based on the number of students who attend UWEC from each county based on the median household income of each county.


Figure 2  Regression results for the number of students attending UWEC from each county based on the percent population of each county with a BS degree.


Figure 3  Regression results for the number of students who attend UWEC from each county based on the population/distance of each county away from UWEC. 



   UW-Platteville has similar results to UWEC. For UW-Platteville, the students median household income do not play a significant role in determining where the students attend college, based on 95% confidence level. If the results were based on a 90% confidence, median household income would be a significant factor (figure 4). The other two variables examined do, however, play a significant role based on the 95% confidence. The percent of population with a BS degree has a significance value of .019 which means that the percent of population in each county does play a role on attendance at UW-Platteville (Figure 5). Also, population/distance has a significance value of .000 which means that it is highly significant in determining whether or not students will attend UW-Platteville (Figure 6).
Figure 4  Regression results for the number of students who attend UW-Platteville from each county based on that counties median household income.


Figure 5  Regression results for the number of students who attend UW-Platteville from each county based on the percent of each counties population with a BS degree.


Figure 6  Regression results for the number of students who attend UW-Platteville from each county based on the population/distance of each county from the university.



   After determining which variables play a significant role in determining whether students will attend a specific university it is important to create a graphic representation to how each county .This is done by looking at the residuals for each county based on the regression equation for each variable. In Figure 7, areas in Marathon County, Brown County, Dane County, and Waukesha County all have higher attendance at UWEC compared to what is expected based on population/distance. On the other hand, there are a lot of different counties that have a smaller representation at UWEC than expected based on population distance. Overall, this map shows areas that have high representation at UWEC even though they are located farther away from UWEC.

Figure 7  Attendance at UWEC shown as residuals based on the regression equation for population/distance from UWEC.


   Figure 8 represents the residuals associated with attendance at UWEC per county based on percent of that county with a BS degree. Eau Claire County has a much higher representation at UWEC compared to what would be expected for the percent of population with a BS degree. Similarly, counties surrounding Eau Claire County also have higher number of attendees at UWEC compared to what is expected. This is possibly caused by the lack of willingness for people in small towns with few BS degrees to travel farther distances to go to different schools. 

Figure 8 Representation of residuals for attendance at UWEC based on expected number of attendees and percent of each county with BS degrees. 

While UWEC has a large number of attendees from counties far away from Eau Claire County, UW-Platteville has very few people travelling far distance to attend. As you can see from Figure 9, there is only three counties that have more attendees than expected based on the regression equation. In addition, most of the counties far away from UW-Platteville have lower attendance per county than would be expected. This means that people aren't willing to travel far distances to attend UW-Platteville.


Figure 9  Mapping residuals associated with expected attendance based on population/distance from UW-Platteville. 

Similarly to population/distance, UW-Platteville has a higher attendance in counties adjacent to UW-Plattevill based on the percent of each county with a BS degree than expected. Areas farther away from UW-Platteville have a lower representation at the university compared to what is expected based on the regression equation. From this information, counties that have higher percent with BS degree are not likely to have high attendance at UW-Platteville unless you are located really close to the university.

Figure 10  Mapping residuals for expected attendance at UW-Platteville for each county based on the percent with BS degree.


Conclusion

   Overall, it appears that the quality of school is a large indicator of how far people will travel to attend. Since UW-Eau Claire is a much better school than UW-Platteville people are willing to travel from all over the state. On the other hand, UW-Platteville is not a very good school so people aren't willing to travel as far. When looking at the percent of population of each county with a BS degree, both UWEC and UW-Platteville have higher than expected attendance from counties adjacent to the universities. This means that people who come from areas with low a low BS degree percentage are less likely to travel farther distances to go to college.









Thursday, April 2, 2015

Correlation and Spatial Autocorrelation

Part 1: Correlation

1.

In this portion of the exercise we are looking at the correlation between distance and sound level. From the given table (Figure 1) we are able to create a scatterplot to better visualize the connection between distance and sound level (Figure 2). After the scatterplot is created we are going to import the table into SPSS and create a Pearson's Correlation matrix (Figure 3).



Figure 1  Excel table with the distance compared to sound level.


Figure 2  Scatterplot showing the relationship between distance and sound level.






Correlations
 
distance ft

sound level dB

distance ft

Pearson Correlation

1

-.896**

Sig. (2-tailed)
 
.000

N

10

10

sound level dB

Pearson Correlation

-.896**

1

Sig. (2-tailed)

.000
 

N

10

10

**. Correlation is significant at the 0.01 level (2-tailed).

 
Figure 3  Pearson's correlation matrix showing the correlation
between distance and sound level. As you can see from the Pearson
Correlation value of -.896 there is a strong negative correlation.


 
In this example the null hypothesis states that there is no linear relationship between distance and sound level. The alternative hypothesis states that there is a linear relationship between distance and sound level.
As you can see from the correlation matrix there is a very high negative correlation between distance and sound level of -.896, meaning that the farther away from the source the lower the sound level will be. This r value of -.896 is significant to the .01 significance level. This means that the null hypothesis is rejected, and that there is a correlation between distance and sound level.



2.

In this section we are given census tract data for Milwuakee County, WI. This data includes percent white, percent black, percent Hispanic, percent with no high school diploma, percent with a Bachelors Degree, percent below the poverty line, and percent that walk to work.



Correlations
 
PerWhite

PerBlack

PerHis

NO_HS

BS

BELOW_POVE

Walk

PerWhite

Pearson Correlation

1

-.887**

-.218**

-.532**

.650**

-.767**

.028

Sig. (2-tailed)
 
.000

.000

.000

.000

.000

.630

N

307

307

307

307

307

307

306

PerBlack

Pearson Correlation

-.887**

1

-.246**

.171**

-.503**

.668**

-.050

Sig. (2-tailed)

.000
 
.000

.003

.000

.000

.386

N

307

307

307

307

307

307

306

PerHis

Pearson Correlation

-.218**

-.246**

1

.759**

-.320**

.182**

.029

Sig. (2-tailed)

.000

.000
 
.000

.000

.001

.616

N

307

307

307

307

307

307

306

NO_HS

Pearson Correlation

-.532**

.171**

.759**

1

-.559**

.501**

.050

Sig. (2-tailed)

.000

.003

.000
 
.000

.000

.384

N

307

307

307

307

307

307

306

BS

Pearson Correlation

.650**

-.503**

-.320**

-.559**

1

-.521**

.081

Sig. (2-tailed)

.000

.000

.000

.000
 
.000

.157

N

307

307

307

307

307

307

306

BELOW_POVE

Pearson Correlation

-.767**

.668**

.182**

.501**

-.521**

1

.354**

Sig. (2-tailed)

.000

.000

.001

.000

.000
 
.000

N

307

307

307

307

307

307

306

Walk

Pearson Correlation

.028

-.050

.029

.050

.081

.354**

1

Sig. (2-tailed)

.630

.386

.616

.384

.157

.000
 

N

306

306

306

306

306

306

306

**. Correlation is significant at the 0.01 level (2-tailed).


Figure 4  Correlation matrix created based on the Milwaukee County data.

 

As you can see from the correlation matrix above there are several strong correlations in the data (Figure 4). First off, it is obvious to state that as the percent of people with no high school diploma increase, the percent of people with a Bachelor's Degree is going to decrease. There is also going to be an increase in people below the poverty line and an increase in the percent of people who walk to work. Unfortunately, there is a really strong relationship between percent Hispanic and percent with no high school diploma. There is also a positive correlation between percent black and percent with no high school diploma. On the other hand, there is a strong positive correlation between percent white and percent with a Bachelor's Degree.


Part 2: Spatial Autocorrelation



Introduction

   In order to determine clustering patterns of voting habits throughout the state of Texas, it is important to look at the spatial autocorrelation of percent democratic vote and the percent of voter turnout. This exercise aims to determine where clustering of the two variables exists and how it has changed between 1980 and 2008.


Methods

    While most of the data is provided by the Texas Election Commission, the percent Hispanic population per county must be downloaded from the U.S. Census Bureau. The voting pattern table must then be joined to the Texas county shapefile. After the table is joined it must be exported as a .dbf file and imported into GeoDa.
    The first step to determine the clustering patterns of Texas voting patterns is to create a Moran's I scatterplot for each variable provided. The Moran's I scatterplot, determined by Equation 1, is a visual representation comparing the value of a variable of one county compared to the values of surrounding counties.

    Eq. 1

The Moran's I scatterplot also provides a value that is a good indicator of spatial autocorrelation. If the Moran's I value is close to 0, there is no spatial autocorrelation with that individual variable, meaning the variable is randomly occurring. Although the Moran's I value tells us how much clustering there is, a map must be created showing where the spatial autocorrelation is occurring. This can be done using the Univariate LISA tool in GeoDa.

    Finally, in order to determine the strength of relationships between variables a correlation matrix must be created using SPSS. The correlation matrix compares each variable together and provides a Pearson's Correlation value. This value is important for determining the strength and direction in which variables interact.

Results

As you can see from the scatterplot below, a Moran's I value of .6957 shows that there is significant clustering of Hispanic populations in the state of Texas. This means that there is large clustering of Hispanic people, as well as areas where non-Hispanics are clustered. Figure X, shows the exact location of where there is clustering of Hispanic populations. As one would expect, the counties in southern Texas contain large number of Hispanic people and counties in northern Texas contain a smaller percent of Hispanic people.



 
 


 
 
 
    Based on the Moran's I value of .5752, there is a large amount of clustering of democratic vote in Texas in 1980. In southern Texas there are a lot of counties that contain a large number of democratic vote. There is also a large clustering of democratic vote in eastern Texas. In contrast, there is clustering of counties with low democratic vote in northern and western Texas.


 


 
    Compared to 1980, there is more clustering of democratic vote in 2008. There is clustering of high democratic vote in southern and western Texas, while there is high clustering of low democratic vote on northern and north central Texas. The clustering of high democratic vote also seems to correlate with areas that have high Hispanic clustering.

 
 


   There is also clustering of voter turnout in 1980, indicated by the Moran's I value of .4681. It appears that areas that had clustering of high democratic vote also had low clustering of voter turnout. In contrast, areas with clustering of low democratic vote have high clustering of voter turnout.



    Compared to the voter turnout in 1980, there is less clustering of voter turnout in 2008, as indicated by the Moran's I value of .3634. Similarly, there is a correlation between clustering of high democratic vote and clustering of low voter turnout and a correlation between clustering of low democratic vote and clustering of high voter turnout.




    Although the local indicators of spatial autocorrelation show areas that have high spatial autocorrelation, they do not show the strength of the correlation between the two variables. In order to determine strength and direction of correlation between the variables above, it is necessary to create a correlation matrix (Figure ). In this correlation matrix there are some obvious trends that affect voting results. First, the correlation coefficient between percent Hispanic and percent democratic vote increased significantly between 1980 and 2008. This means that areas with high Hispanic populations are more likely to vote democratic. However, there is a negative correlation between percent Hispanic and voter turnout, in both 1980 and 2008, meaning that the counties with high Hispanic populations have lower voter turnout.
 
    Another strong negative correlation exists between voter turnout and percent of democratic vote exists, meaning that as voter turnout increases the percent of democratic vote decreases in both 1980 and 2008.



Correlations
 
hd02_s114

Pres80D

Pres08D

vtp80

vtp08

hd02_s114

Pearson Correlation

1

.093

.669**

-.407**

-.668**

Sig. (2-tailed)
 
.139

.000

.000

.000

N

254

254

254

254

254

Pres80D

Pearson Correlation

.093

1

.540**

-.612**

-.484**

Sig. (2-tailed)

.139
 
.000

.000

.000

N

254

254

254

254

254

Pres08D

Pearson Correlation

.669**

.540**

1

-.600**

-.604**

Sig. (2-tailed)

.000

.000
 
.000

.000

N

254

254

254

254

254

vtp80

Pearson Correlation

-.407**

-.612**

-.600**

1

.664**

Sig. (2-tailed)

.000

.000

.000
 
.000

N

254

254

254

254

254

vtp08

Pearson Correlation

-.668**

-.484**

-.604**

.664**

1

Sig. (2-tailed)

.000

.000

.000

.000
 

N

254

254

254

254

254

**. Correlation is significant at the 0.01 level (2-tailed).



 


Conclusion

    In conclusion, there is definite clustering of voting results and voter turnouts throughout the state of Texas. There has also been some slight changes in the clustering patters of both results and voter turnout between 1980 and 2008. The only area that had constant clustering of democratic vote and voter turnout between 1980 and 2008 is the southern tip or Texas. This area has had a large percent of democratic vote and a low voter turnout in both 1980 and 2008.