This post comes from a real-life question of a scientist friend of mine, but I’ve abstracted out all the science-y stuff. Rest assured that no cure for cancer or discovery of alien life is resting on the answer — rather, it’s just something I’m curious about. Also, in case it doesn’t become clear from reading what follows, I am an expert neither in the experimental side here nor in the statistical side.
Consider the following experimental situation: I have three data samples called A, B and C. Let’s say, for example, that each of , and is a 1000-dimensional real vector whose entries are drawn independently from the same distribution (maybe standard normal, maybe something else; or maybe actually for each i I have a normal distribution whose standard deviation is itself a random variable), but where we can have dependencies between, for example, entries of A and entries of B. (Also, possibly “in real life” I actually have a small number of copies of type A, B and C and in what follows I am actually using the means of each type.) I hypothesize (for reasons related to science) that in some sense B and C should be “closer” to each other than either is to A. I would like to perform a statistical test to confirm this expectation, and also possibly to identify the coordinates that best distinguish A from the other two sets. (In this case, it’s important to me that by “coordinates” I mean “coordinates in the usual basis in which the vectors are given”; I also can separately run a PCA and extract information that way, but this is not what I’m interested in here.) In particular, I’d like to do something like the following:
- First, I want to split the coordinates into two halves, those for which A is bigger than B and those for which B is bigger than A. Actually, because my data is quite noisy and many coordinates are not really meaningful, I probably take a nonnegative number and only look at those indices i such that .
- Then, I look at these coordinates and see whether they are larger in A or larger in C.
- At this point, I run a chi-square test for independence, which spits out a p-value (probably much) smaller than , and so I assert with high confidence that differences between A and B are not independent of differences between A and C, and consequently that B and C are more similar to each other than either is to A.
The problem is that the last line of reasoning is complete garbage. The chi-square test has resoundingly rejected the null hypothesis that differences between A and B are independent from differences between A and C, but this is an idiotic null hypothesis: both of these differences are dependent on the actual vector A, so of course they aren’t independent. And, indeed, running this test on randomly generated data will reject the null hypothesis here with impressive-looking p-values every single time.
My question is, what do I do instead of a conventional chi-square test for independence so as to reject a more meaningful null hypothesis? One possible technique is to use synthetic data. For example, if the entries are drawn uniformly from some interval and then it’s easy to see that the typical outcome is that two-thirds of the coordinates selected by should also satisfy , and in 20,000 or so trials on my computer I get that the actual percentage of coordinates with this property is strictly between 0.656 and 0.676 every single time (so there’s a strong central limit tendency). This suggests a non-idiotic p-value much less than if, for example, I actually have and so that I really should be rejecting in this case.
I could easily repeat this experiment for any particular vector length, distribution of the entries, and choice of . However, this is clearly limiting in the sense that any time I wanted an answer I would have to do a bunch of statistics to decide on the right distributions to use before going and coding everything up. It would be more interesting to me if there were some known statistical tests or mathematical theorems that would allow me to say something without going through that trouble.
It also seems likely to me that in the uniform distribution case we can actually solve this problem analytically. Anyone want to give it a try?