


Tests for Differences



. Even when, in
reality, there are no meaningful differences between groups or no real changes
over time, there will be fluctuations in the data that arise solely from
sampling or measurement error. Consider flipping a coin. Even though
tossing either a "heads" or a "tails" each has a probability of 0.5 (a
5050 chance), you will not often get exactly five of each with ten tosses of
the coin. The same is true of taking any measurement from samples drawn
from the population  even when there are no real differences, the
data you collect across a series of samples (two subgroups or two points in time
can represent two samples) will rarely match perfectly across the samples.
One will usually seem "more" and the other "less." Sometimes, by chance
alone, some will be a lot more, and others a lot less.
As a result, it is often
necessary to apply some statistical criterion (or criteria) to decide which
differences are large enough that we will not simply attribute them to normal
(random) fluctuations in the data. In essence, we use a statistical test
to help us decide how much of a difference is likely to be a real difference.
Generally, such tests weigh differences between groups (or between two or more
time periods) against the variability that is seen in the data within each of
the groups (or within each time point). When differences between groups or
over time grow larger than would be expected given the variability observed
within groups or within individual time points, then they become less likely to
be due to chance alone. By convention, when the differences observed are
likely to be seen less often than once in twenty times by chance alone (a
probability of less than 0.5; expressed as " p < .05"), then we
will accept them as statistically significant. If we make many
comparisons, we will also consider the pattern of findings, so that occasional
differences that arise within the context of many comparisons are evaluated
within the context of the findings as a whole. Ideally, individual
significant differences will form a pattern that increases our confidence that
they are meaningful.
The specific tests that
are applied for any given comparison will vary depending on the nature of the
data that are being analyzed. Unless you have a solid background in
measurement and statistics, it is likely that you will need the assistance of a
professional analyst to conduct such tests and to interpret their findings.
At GuideStar Research, evaluating the significance of group differences and
changes that occur over time is a routine part of our strategic consulting and
reporting services.
. The ttest and
analysis of variance procedures are common tests within this category.
These tests compare average (mean) scores across groups or within one or more
groups over time. Though more powerful than other alternatives, these
procedures are designed for data that is collected using intervallevel
measurement and that is normally (or at least symmetrically) distributed.
If you do not have an interval level measure or if your data are not normally
distributed, then these procedures may not be acceptable for your analyses.
. These include procedures that
use a variety of approaches suitable to a range of different comparisons.
Some, for example, compare the observed frequencies in data tables to what we
might expect from chance (e.g., the Chi^{2} test). Others
rank the data within each group and then compare ranks in each of the groups
(e.g., the MannWhitney U test) or look at differences as simply "more" or
"less" in pairs of data taken from the same group of people (e.g., the Wilcoxon
Sign test). When data are not normally distributed (e.g., there are
unusually low or high values or the data are skewed towards higher or lower
values), then these tests or other procedures for normalizing data may also be
indicated. Again, this is where the expert consulting we provide at
GuideStar Research plays an important role in the research process.
. Survey researchers should present you with the margin of
error for estimates of percentages along with the percentages themselves.
Differences that exceed 1.7 times the margin of error (this generally
approximates the margin of error for difference scores) are usually meaningful
and those that do not usually are not. If the President has a 60% approval
rating today and had a 55% approval rating last month, each with a margin of
error of +4%, then the margin of error for differences is slightly larger
than the change that was observed across the two measurement times. As a
result, we would suggest caution in interpreting this as a meaningful increase
in approval ratings. When using this kind of rule of thumb, however, be
advised that margins of error that apply to the sample as a whole will be
smaller than the margin of error for subgroups within the data. When
comparing two subgroups, you will need the margins of error for those groups to
use this approach.
When averages (mean
scores) are computed, we can also compute the confidence intervals for
those averages. When the confidence intervals for two averages (means) do
not overlap, then it is also likely that the scores will be significantly
different. When they do overlap, then unless the groups are very large,
they are unlikely to be significantly different. Some researchers
prefer approaches like this (e.g., looking at the confidence interval for the
difference between two scores) as it is a more conservative test.



