You can't find small effects with coarse measures
Suppose you want to do a regression to see if black players in MLB are underpaid.
How might you do that? Well, you might take everything you can think of, throw it into a dataset along with a "player is black" dummy variable, and do a regression to predict salary. If the dummy comes up significant, you've proven there's a difference between black players and non-black players.
But ... there's a problem with that. You've shown a significant difference between black players and white players, but you don't really know if it's really because of the player's skin color. If black players and white players are different in certain ways, it could be that your model is just biased for those differences.
For instance ... you probably had a measure of player productivity in your regression. Suppose you used, say, "productivity = (TB+BB+SB)/PA". That's obviously not accurate: it treats a stolen base as equal to a single, and doesn't include CS, which correlates highly with SB. So, it's going to be biased too high for players who steal a lot of bases.
So if black players steal more bases than white players -- which I bet they do -- your measure will overrate the productivity of black players, and will incorrectly conclude that they're underpaid.
I've made my example a particularly egregious one, but this happens all the time to a lesser extent. If your regression is trying to relate salary to productivity, you have to be able to accurately measure both salary and productivity. Salary is easy -- it's just the amount, in dollars. Productivity is hard.
Sabermetricians have been trying to measure batter productivity for ... well, forever. We've got Total Average, and Runs Created, and Base Runs, and Linear Weights, and Extrapolated Runs, and so on. None of them is perfect. All of them have certain biases. (Self-promotion: this .pdf has an article where I investigate some of them, and here's a related post.) Bill James, himself, noted that Runs Created overestimates for the best hitters.
If we're so limited in our ability to measure productivity in the first place, how can anyone possibly think we're able to measure *very small differences* in productivity, like racial bias, or clutch vs. non-clutch, or walk year vs. non-walk-year?
We can't. Statistical significance gives you the illusion we can, but we can't.
Look, suppose you do a study, and you find that black free agents are underpaid by, say, $100K a year. Even if it's statistically significant, $100,000 is only one-fifth of a run. How can you say, with any kind of confidence, that you've found an effect of 0.2 runs, when your measure of productivity is almost certainly biased by a lot more than that?
How can you find a 1 gram difference in weight, when your scale is only unbiased to the kilogram?
Now, someone might argue: "I agree with that, that many measures of productivity are biased. That's why I didn't use one. So as not to preselect a biased measure, I put all the components of hitting into the dataset, and let the *regression* pick the best measure!"
That helps, but it isn't enough. Because, what if the relationship isn't linear? Like, playing time. If you're good, you play full-time. If you're bad, you play zero. In the middle ... well, you play part time, and *maybe* that part is linear, and maybe it's not. But, overall, you've got an s-curve, not a linear one. So your estimates will be biased too high for the best players, and too low for the worst. (See the previous post.)
I'd argue that, if you find a small effect that you think is real, you need to prove that your model is good enough that what you've found is an actual effect, and not just measurement error from an arbitrary linear model.
I don't think that's too hard to do ... I'm going to try to put together an example for a followup post.