Thursday, March 12, 2015

Significance Testing


Part 1:

1.

2. 
Asian-Long Horned Beetle
     The null hypothesis states that the number of Asian-Long Horned Beetles in Bucks County is the same as the entire state of Pennsylvania. The alternative hypothesis states that there is a difference between the number of beetles found in Bucks County and the average number of beetles for the entire state of Pennsylvania. Since we are looking at 50 fields we are going to use a z-test. Since we are using a two tailed test with a 95% confidence interval we determine that the critical value is 1.96.
   Using the z-test equation
in this case the sample mean is 3.2, the hypothesized mean is 4, the standard deviation is .73, and the number of observations is 50. Therefore, the z-statistic in this scenario is -8, which allows us to reject the null hypothesis that there is no difference between the number of beetles found in Bucks County and the entire state of Pennsylvania. This means that there is less Asian-Long Horned Beetles in Bucks County than you would expect to see. 

Emerald Ash Borer Beetle
    The null hypothesis states that the number of Emerald Ash Borer Beetles found in Bucks County is the same as the average number of beetles that should be found in each county based on the statewide average. The alternative hypothesis states that there is a difference between the number of beetles found in Bucks County compared to the statewide average that should be present in each county of Pennsylvania. Since we are using a two tailed test at 95% confidence we can calculate the critical value of 1.96. Using the same equation as above, the sample mean=11.7, the hypothesized mean=10, the standard deviation=1.3, and the number of observations=50. This gives a z-statistic of 9.4, which causes us to reject the null hypothesis. Therefore, there is a difference between the number of Emerald Ash Borer Beetles found in Bucks County compared to the number of beetles estimated for each county in Pennsylvania. Since the z-statistic is greater than the critical value we can determine that there are more Emerald Ash Borer Beetles in Bucks County.

Golden Nematode
     The null hypothesis states that the number of Golden Nematode found in Bucks County is no difference than the average number of Golden Nematode found in any other county in Pennsylvania. The alternative hypothesis states that the number of Golden Nematodes found in Bucks County is different than the average number of Golden Nematode found in any other county in Pennsylvania. Since we are using a two tailed test with 95% confidence we calculate the critical value of 1.96. Using the z-test equation above, the sample mean is 77, the hypothesized mean is 75, the standard deviation is 5.71, and the number of observations is 50. We can calculate a z-statistic of 2.47, which means we can reject the null hypothesis. This means there is a difference between the number of Golden Nematodes found in Bucks County compared to the average per county in Pennsylvania. Since the z-statistic is greater than the critical value we can determine that there are more Golden Nematodes in Bucks County.




3.
     The null hypothesis states that there is no difference between the number of people per party in 1960 and the number of people per party from the sample taken in 1985. The alternative hypothesis states that there is a difference between the number of people per party from the sample taken in 1985 compared to the number of people per party from the survey taken in 1965. Since we are using a one tailed test with 95% confidence we calculate the critical value to be 1.64. Since the number of observations is less than 30 we are going to use a t-test. Based on the t-test equation:
the sample mean is 3.4, the hypothesized mean is 2.1, the standard deviation is 1.32, and the number of observations in 25. Therefore, the t-statistic equals 4.92, which is much higher than the critical value of 1.64. This allows us to reject the null hypothesis and confirm that there is a difference between the sample collected in 1985 and the survey conducted in 1965.





Part 2:

Introduction

     There is a common debate with Wisconsin residents about what makes "Up North" different than areas in southern WI.  Northern Wisconsin is commonly defined as the area north of Highway 29, which spans from the Elk Mound all the way to Green Bay (Figure 1). It is known for its large forests, small  population, and fun outdoor recreation. In this exercise we will be exploring different variables to determine whether they are good indicators of what makes up the great Wisconsin Northwoods. In order to compare different variables we will be using SPSS to run a regression analysis on three variables, which are commonly thought to differentiate northern WI from southern WI. The three variables I will be using include: total population, number of hotel beds, and miles of funded snowmobile trails.



Figure 1  Northern WI counties and southern WI counties determined by their location compared to Highway 29 that runs east/west across the state.

Methods

As previously state, the main goal of this exercise is to determine if certain variables differentiate northern WI and southern WI. We start by assigning a value to each county based on whether it is located in northern WI or southern WI. We then join an Excel table to the counties shapefile in ArcMap. Then attribute fields are added to give each county a rank. The ranks will be based on equal interval classifications for the range of each variable. The ranks for total population are as follows: 4709-236873 are given a 1, 236874-4457278 is given a 2, 457279-701211 is given a 3, and 701212-933380 is given a 4. The population rank and the location value (1 or 2) ara then input into the Crosstabs function in SPSS. SPSS will then create 2 tables that are useful in determining whether there is a differnce between northern and southern counties. This process will then be done for the number of hotel beds and the total length of snomobile trails per county as well.

Results


     Based on the equal interval classification scheme for total population, there were only four counties in the entire state that contained higher population than the first classification of 236,873 people. These counties include Milwaukee County, Waukesha County, Dane County, and Brown County, which are all located in southern WI (Figure 2). After the population variable was plugged into SPSS, two tables were provided. Table 1 shows the expected counts and observed counts for each population classification based on their location. Table 2 shows the Chi-squared value associated with total population. Since the Chi-squared value is only 2.541 we fail to reject the null hypothesis that there is a difference between the population of northern WI and southern WI. This can also be noted with the asymptotic significance. This value is .281, which is much larger than the .05 required to reject the null hypothesis at 95% confidence.




Figure 2  Total population of counties based on an equal interval classification scheme. Only four of the 72 counties in WI contain larger populations than the first classification break.





nvs * tot_poop Crosstabulation
 
tot_poop

Total

1

2

4

nvs

1

Count

27

0

0

27

Expected Count

25.5

1.1

.4

27.0

2

Count

41

3

1

45

Expected Count

42.5

1.9

.6

45.0

Total

Count

68

3

1

72

Expected Count

68.0

3.0

1.0

72.0
Table 1  Expected and observed counts of the four population classifications for counties in northern WI vs counties in southern WI. This table is used to calculate the degrees of freedom in Table 2




Chi-Square Tests
 
Value

df

Asymp. Sig. (2-sided)

Pearson Chi-Square

2.541a

2

.281

Likelihood Ratio

3.900

2

.142

Linear-by-Linear Association

1.852

1

.174

N of Valid Cases

72
  

a. 4 cells (66.7%) have expected count less than 5. The minimum expected count is .38.

Table 2  Chi-squared test that is used to determine whether the null hypothesis should be rejected. In this case, the Chi-squared value is lower than the value associated with 2 degrees of freedom, which means the null hypothesis has failed to be rejected.


     Although all of the major population centers are located in southern WI, there are also many counties that have very few people. Since the number of counties with small populations greatly outweight the 4 counties with large populations they have a bigger impact on the Chi-squared value.


     Similarly to total population per county, there is no differnce between the number of hotel beds in northern WI compared to southern WI. According to the Chi-squared chart, a 95% confidence interval with 3 degrees of freedom has a critical value of 7.81. Therefore, the Pearson's Chi-squared value of 3.871 is less and we fail to reject the null hypothesis. We can also determine whether there is a differnce in hotel beds in northern versus southern WI is the asymptotic significance value. Since we are using a 95% confidence interval the asymptotic significance value must be below .05 to say that there is a difference between the two regions. In this case the asymptotic significance value of .276 is much larger than .05, meaning we fail to reject the null hypothesis.
    


Figure 3  The number of hotel beds per county based on equal interval classification.





nvs * htl_bds Crosstabulation
 
htl_bds

Total

1

2

3

4

nvs

1

Count

23

3

1

0

27

Expected Count

23.3

2.3

.4

1.1

27.0

2

Count

39

3

0

3

45

Expected Count

38.8

3.8

.6

1.9

45.0

Total

Count

62

6

1

3

72

Expected Count

62.0

6.0

1.0

3.0

72.0
Table 3 Table showing the expected and observed count of counties that fall into each of the classifications.






Chi-Square Tests
 
Value

df

Asymp. Sig. (2-sided)

Pearson Chi-Square

3.871a

3

.276

Likelihood Ratio

5.173

3

.160

Linear-by-Linear Association

.241

1

.623

N of Valid Cases

72
  

a. 6 cells (75.0%) have expected count less than 5. The minimum expected count is .38.


Table 4 The Chi-sqaure value of 3.871 is much lower than the value required to reject the null hypothesis for 3 degrees of freedom.

    
    

Finally, many Wisconsin residents consider northern WI a snowmobiler's paradise. Since there are 3 degrees of freedom at 95% convidence, the Pearson's Chi-squared value must be at least 7.81. As you can see from Table 6, The Chi-square value for snowmobile trails is 24.424, meaning that there is a significant difference between snowmobile trails in northern WI and southern WI.




Figure 4  Length of funded snowmobile trails per county based on equal interval classification.




nvs * snow_trail Crosstabulation
 
snow_trail

Total

1

2

3

4

nvs

1

Count

2

13

9

3

27

Expected Count

8.3

13.9

3.8

1.1

27.0

2

Count

20

24

1

0

45

Expected Count

13.8

23.1

6.3

1.9

45.0

Total

Count

22

37

10

3

72

Expected Count

22.0

37.0

10.0

3.0

72.0
Table 5  Observed and expected values for snowmobile trail length per county. These values are used to estimate Pearson's Chi-square value.




Chi-Square Tests
 
Value

df

Asymp. Sig. (2-sided)

Pearson Chi-Square

24.424a

3

.000

Likelihood Ratio

27.387

3

.000

Linear-by-Linear Association

22.494

1

.000

N of Valid Cases

72
  

a. 3 cells (37.5%) have expected count less than 5. The minimum expected count is 1.13.


Table 6  Pearson's Chi-square value for snowmobile trails in nothern vs souther WI. As you can see the Pearson's Chi-square value of 24.424 is much higher than the critical value of 7.81. Therefore, there is a difference between snowmobile trail length in norther and southern WI.
 


Conclusion

Based on the Chi-square values for each of the variables selected the only one that actually differentiates northern and southern WI is the total lenght of snowmobile trails per county. I had thought that population would have also been a determining factor, but based on the way the classifications were set up, the only counties that were difference in population were Milwaukee, Waukesha, Dane, and Brown Counties. I feel as if the classifications would have been set up differently total population and number of hotel beds would have been significant in differentiating between the two regions.  Even though Eau Claire contains over 70,000 people, it still showed up in the lowest population class, which gave it the same overall weight as Rusk County, which only has 14,000 residents.




Thursday, February 26, 2015

Z-Scores, Mean Center, and Standard Distance


 

Introduction

     As climate continues to change, the severity and spatial pattern of tornadoes is likely to change as well. It is widely believed that tornadoes will become more frequent in occurrence and stronger in intensity across the Midwest. Therefore, it is important to analyze the location and intensity over recent years to determine spatial patterns. After spatial patterns have been determined government officials can use this data to create various laws that will increase the safety of individuals at risk. In this specific exercise we will be calculated mean center, weighted mean center, standard distance, and weighted standard distance of reported tornadoes with a known width between 1995-2006 and 2007-2012. This information will then be used to determine if storm shelters should be required across the entirety of Oklahoma and Kansas or just in areas that experience the majority of the largest tornadoes.
 

Study Area

      For this exercise, we will be examining the two states that experience the largest number of severe tornadoes, Oklahoma and Kansas (Figure 1). At the heart of "Tornado Ally" these two states were impacted by 2,221 tornadoes from the year 1995 to 2012, with 66 being over 700 feet in width. The combined population of Oklahoma and Kansas is 6,782,072 according to 2014 estimates and the combined land surface area is 152,175 square miles, leading to a combined population density of 44 people per square mile.
 
 
 

Methodology

     Several different statistical tools can be used in order to determine the spatial distribution of tornadoes across Oklahoma and Kansas. Based on the geographic data of tornadoes we use ArcGIS to calculate the mean center, weighted mean center, standard distance, and the weighted standard distance. The mean center of the data is the geographic center of all of the points. In order to calculate the mean center, the latitude values of all points are averaged and all the longitude values are averaged (Equation 1).
 
 
 
Equation 1
 
 
     Similarly to mean center, weighted mean center also finds the geographic center of a set of points. However, weighted mean center applies a weight to each of the latitude and longitude values for each point (Equation 2). In this scenario, the width of each tornado is applied and shifts the overall center of the data toward areas that have high frequencies of large tornadoes.
 
 
 
Equation 2
 
 
     Next, a standard distance is created around the tornado locations for the years 1995-2006 and from 2006-2012, respectively. The standard distance is the radius of the circle around the mean center that contains 68 percent of the total points (Equation 3). This information can be useful in determining how concentrated a data set is based on the geographic mean center. 
 
 
Equation 3
 
 
      A weighted standard distance was also applied to the tornado data. Similarly to the standard distance the weighted standard distance determines the length of a radius then contains 68% of the total data points based on some defined weight (Equation 4). Using a weighted mean center is used in determining concentration of data points based on whichever weight is applied, in this case the width of tornadoes.
 
 
Equation 4
 
    
     Next, the Z-scores were calculated for three counties in the area. Z-scores are defined as the relative position of the data compared to the mean. The Z-score value basically tells us how many standard deviations above or below the mean a specific point falls on.
 
 
Equation 5
 
     
     Finally, we calculated the number of tornadoes that will occur in any given county 70% of the time and 20% of the time respectively. In order to do this we need to find the corresponding Z-score associated with each percentage, and solve for Xi in Equation 5.

Results

 
     Based on calculations above, we determined the location of the geographic mean center of tornadoes in the study area between 1995 and 2006 is located just north of border of Kansas and Oklahoma pretty much right in the center of the study area. This shows us that the distribution of tornadoes is spread fairly evenly throughout the area. However, when weighted by tornado width, the weighted mean center shifts slightly south. This states that while there is an equal number of tornadoes spread throughout the study area, there are stronger tornadoes located in the southern portion of the study area.
 
 
Figure 2  Locations of tornadoes from 1995-2006 and the
location of the geographic mean center and the weighted mean
center based on tornado width.
 
      Between the years 2007-2012 the location of tornadoes shifts slightly compared to 1995-2006. The geographic mean center shifts slightly north showing that a larger number of tornadoes are occurring in Kansas. The weighted mean center shifted slightly north and slightly east. This means that there were less large tornadoes that occurred in the southern portion of the study area. It also means more large tornadoes occurred in the eastern portion of the study area.
 
Figure 3  Locations of tornadoes from 2007-2012 and the location
of the location of the geographic mean center and the weighted mean
center based on width.
 
      Figure 4 shows a combination of Figure 2 and Figure 3. As you can see, less large tornadoes occurred in the southern portion between 2007-2012 than 1995-2006. You can also see that less large tornadoes occurred in the western portion of the study area from 2007-2012 than 1995-2006.

Figure 4  The mean center and weighted mean center of tornadoes
from 1995-2006 compared to tornadoes from 2007-2012.
 
 
     Figure 5 shows the standard distance of tornado locations weighted by width. As you can see the weighted standard distance shows the area where 68% of the tornado width occurs. The standard distance in this case is fairly large which indicates that there is not a strong concentration of points located in one area, but rather tornadoes spread fairly evenly across the area.


Figure 5  Standard distance of tornadoes weighted on tornado width.
 
 
     From 2007-2012 the weighted standard distance (Figure 6) is much smaller than the standard distance from 1995-2006. This means that more of the larger tornadoes become concentrated toward the center of the study area. 
 
Figure 6  Weighted standard distance tornadoes from 2007-2012
based on width.
 
 
      Figure 7 compares the standard distance from 1995-2006 and from 2007-2012. There is a slight shift of larger to the north east and an overall increase in concentration of larger tornadoes toward the weighted mean center.


Figure 7  Comparing the weighted standard distance between 1995-2006 and 2007-2012
to show how the concentration of tornadoes has change over the two time periods.
 
 
     We also mapped the standard deviation of tornadoes per county from 2007-2012 (Figure 8). This map shows the number of tornadoes per county compared to the mean. There are 8 counties throughout the study area that experience over 1.5 standard deviations of tornadoes, while there were around 50 counties that experienced less than -.5 standard deviations of tornadoes.


Figure 8  Map showing the number of tornadoes per county based on the
standard deviation.
 
 
 
 
      According to the data, Russell County, KS has experience 25 tornadoes between 2007 and 2012. This number is significantly higher than the mean of 4.3 per county. Using Equation 5 above, the Z-score of Russell County, KS is 4.88. This means that the number is tornadoes in Russell County, KS is 4.88 standard deviations above the mean of the study area. Caddo County, OK experienced a slightly lower number of 13 tornadoes from 2007-2012 than Russell County. However, 13 tornadoes is significantly higher than the mean of 4.3 tornadoes per county. Using Equation 5, the Z-score of Caddo County, OK was calculated to be 2.09, meaning that the number of tornadoes in Caddo County is 2.09 standard deviations above the mean. Finally, the Z-score was calculated for Alfalfa County, OK. This county experienced only 5 tornadoes from 2007-2012. Using Equation 5, the Z-score was calculated to be .23, meaning the number of tornadoes in Alfalfa County was only .23 standard deviations above the mean.
 
     Finally, based on Equation 5, we determined that each county in the study area will experience 1.764 tornadoes 70% of the time between the given years. We also conclude that each county in the study area will experience 7.612 tornadoes only 20% of the time over the given years.
 

Conclusion

     In conclusion, The mean center and weighted mean center don't necessarily tell us whether or not an area should be force to build tornado shelters. They basically just tell us that tornadoes occur pretty evenly throughout the entirety of the study area. The standard distance also does not tell us where tornado shelters should be required because the radius of the standard distance is so large. If more tornadoes were concentrated closer to the mean center, a case could be made to require tornado shelters. Finally, the map of the standard deviation does provide a good example of where tornado shelters should be required. Counties that experience less than -.5 standard deviations of tornadoes should not be required to build tornado shelters, while areas that experience over 1.5 standard deviations of tornadoes should be required to build shelters. However, since there are so many tornadoes that occur in a fairly random pattern, it would not be a bad idea to strongly encourage everybody living in this area to have a tornado shelter. One can never be quite sure whether or not a specific county will experience an EF5 tornado that will have catastrophic effects.