# Describe the type of data that you will collect and how you plan to collect this data

Either individually or in groups of 2 or 3, your task is to perform some real-world inferential statistics. You will take a claim that someone has made, form a hypothesis from that, collect the data necessary to test the hypothesis, perform a hypothesis test, and interpret the results.

*You* *will test to see if less than 50% of students participate in the Student Evaluation of Teaching system (SETS) in the School of Business Administration at USCA.* *Why or Why not?*

## Determine and describe the type of data that you will collect and how you plan to collect this data in order to answer your questions. You will need to collect data on many characteristics of your sample so that these characteristics can later be compared somehow (e.g., before and after data; comparisons by gender, major, type, year, age, etc.) Define the population and the sample that you will be studying. (you must sample at least 100 students in the SOBA)

## Project Components

The report will include a description of the problem, and why you think it is important, or what you hope to gain from testing the hypothesis. It should also include the context of the data, all data collected, and the values generated in **EXCEL**. A decision and conclusion should be stated. An analysis should follow with what the conclusion means in terms of the original problem. The report should be in narrative format like you were writing for a newspaper or magazine, must be typed, printed, and should be double spaced.

An *excellent* final report (100 points) will have the following components.

• An introduction to the problem including the claim(s) being tested

• The context (who, what, where, when, why, how) of the data (remember this is in narrative format) and any possible problems with collecting the data

• Descriptive statistics and/or tables depending on your type of data

• Appropriate graphs (every project should have at least one graph or chart of the data in it)

• Inferential statistics including …

• the null and alternative hypotheses written symbolically

• statistical output including a test statistic and p-value

• a graph showing the critical and non-critical regions, test statistic, and p-value

• the decision and a conclusion written in terms of the original claim

• Conclusion

• Suggestions for the next time this project is done

• No statistical usage errors

## What can we test?

Some things are easier to test than other things. The purpose of this project is to expose you to the process of hypothesis testing in a real-world application. You may test means, proportions, or linear correlation. You may have one or more samples. You may categorize your variables in one or two ways.

If you are dealing with one sample, then you will need some numerical value to test against. The claim “more people prefer Pepsi than Coke” becomes a claim that the proportion of Pepsi drinkers is greater than 0.5. There are not two independent samples (Pepsi drinkers / Coke drinkers), just one sample categorized in two ways. A problem with the Pepsi / Coke thing is that it omits other soft drinks because that is more difficult to do. A chi-square goodness of fit test would be more appropriate in this case.

#### Categorical Data

If your data consists solely of categories and not measured quantities, then you should be looking at proportions or counts.

Things to look for that let you know you’re dealing with categorical data or proportions include: proportions, percents, counts, frequencies, fractions, or ratios. If your data consists of names or labels, you’re dealing with categorical data.

You really need to think about the response that was recorded for each case. Did you record a yes/no response for each case or did you record a number that means something? If it was a yes/no or other categorical data, then this is the place to be.

#### Example Claims about Categorical Data

• 93.1% of Americans feel there should not be nudity on television during children’s viewing time.

http://www.parentstv.org/PTC/publications/lbbcolumns/2003/0528.asp

This is a claim about a single proportion. We know this because the value includes a percentage and the data is categorical (yes or no), not numerical. The original claim here could be written as p=0.931.

#### Quantitative (Numerical) Data

If your data consists of measured quantities, then you will probably be testing a mean or perhaps correlation between two variables. It is possible to test a claim about a standard deviation, but that is rare, and not covered in this course.

There are four main ways to analyze means.

1. A test about a single mean that requires a number as the claimed value.

2. A test about two independent means doesn’t need a number because you compare them to each other. This compares the same thing in two different groups.

3. A test for two dependent means, often called paired samples, compares two values for each case in the same group.

4. The Analysis of Variance is an extension of the two independent samples case where there are more than two groups.

You can also perform correlation and regression with two quantitative variables. Simple regression, with just one predictor variable, is covered in the book. Multiple regression, with several predictor variables, is not covered in the textbook but is available online.

#### Example Claims about Quantitative Data:

• Women live five years longer than men. http://www.medicalnewstoday.com/medicalnews.php?newsid=18866 This is a claim about two averages, the average lifespan of women and that of men. We don’t know the average of either gender (they’re given in the article), we just know that women are supposed to live five years longer than men. When you’re working with one sample, it’s important to have a value to compare against, but with two samples, you don’t need a value for each, just the difference between the two (in this case 5 years). The original claim here could be written as μw-μm=5 (the difference in the mean ages of women and men is 5 years).

• Seat belts save lives. http://dot.state.il.us/trafficsafety/seatbelt june 2006.pdf and http://www-fars.nhtsa.dot.gov/FinalReport.cfm?stateid=17&title=states&title2=fatalities_and_fatality_rates&year=2005. Okay, this claim is all over the place, but I wanted to give some links on how it would be tested.

You could take the data regarding the percent of people wearing their seat belts and compare it to the fatality rate. These are two numerical values that are paired together for each case (probably based on an annual report). Remember that you cannot perform correlation and regression with categorical variables. The original claim that seat belts save lives would be interpreted as a negative correlation (as seat belt use goes up, fatalities go down) and would be written as ρ<0.

### Sample Final Report

Available online are sample projects and resources. Your project may not be as long or detailed.