Announcement

Collapse
No announcement yet.

Statistical Significance

Collapse
This topic is closed.
X
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    And if you are going to hold my statement to me. People have even complained of JUST making 300 skill (no AA's or mods) making them worse...

    I really think it is a perception thing too. Either people are not paying as mutch attention as they are GAINING skill... or there is the perception once they have maxed that they should then be perfect.
    Ngreth Thergn

    Ngreth nice Ogre. Ngreth not eat you. Well.... Ngreth not eat you if you still wiggle!
    Grandmaster Smith 250
    Master Tailor 200
    Ogres not dumb - we not lose entire city to froggies

    Comment


    • #17
      I will also try to dig up his comment on streaks... he had one really bad streak in there...
      Ngreth Thergn

      Ngreth nice Ogre. Ngreth not eat you. Well.... Ngreth not eat you if you still wiggle!
      Grandmaster Smith 250
      Master Tailor 200
      Ogres not dumb - we not lose entire city to froggies

      Comment


      • #18
        Remember that at the 95% significance level, if you run a test 100 times, 5 of them will be outside the expected range. So while we reject the hypothesis that it was drawn from a distribution with the hypothesized mean, we do not prove/disprove anything. Things that are 5% likely to happen (like a sample falling outside of a 95% confidence interval while the true mean is in fact as hypothesized) happen, roughly, 5% of the time.

        Second EDIT: When faced with a result outside the confidence interval, it's important to think about causation. Do we really think the bug is that the hard-cap on 95% success is broken, but that getting a better trophy fixes it? That's a really strange hypothesis. Given how solid the 95% hard-cap is, and how far away from the conditions which trigger the loosening of that cap (I think if you are 40 points above trivial, unmodified, the hard cap moves to 96%), all the more reason to suspect we're seeing one of the 5 times (or is it ten, did you use a 90% confidence interval) in a hundred that **** happens.

        EDIT: Also, importantly -- as was stated but has been glossed over, what Ngreth's tests DO NOT show, is that the samples are drawn from distributions with different means. To test that, we'd need to see non-overlapping confidence intervals, and I am certain that the 966 sample is NOT statistically significantly different from the other samples. It may be outside the range of expected values for the hypothesized 95% success rate, but that is an easier test to fail than testing whether both samples are drawn from a distribution with a common mean.
        Last edited by andyhre; 06-08-2006, 11:56 AM.
        Andyhre playing Guiscard, 78th-level Ranger, E`ci (Tunare)
        Master Artisan (2100 Club), Wielder of the Fully Functional Artisan's Charm, Proud carrier of the 8th shawl


        with occasion to call upon Gnomedeguerre, 16th-level Wizard, Master Tinker, E`ci (Tunare)


        and in shouting range of Vassl Ofguiscard, 73rd-level Enchanter, GM Jewelcrafter, E`ci (Tunare)

        Comment


        • #19
          I have done lots and lots of weapon and mitigation parsing in EQ and I quickly realised that for those purposes I had to have at least 10.000 hits or more before it began to be even remotely usefull.

          I have also gotten 300 in 6 tradeskill the last month where I needed between 125 and 6 skillups in those, and for all of them there was skillups that took ages and some came after 2 combines. So for something like this to be accurate more than 1k combines is needed for it to be accurate, but 1k combines is also plenty to prove that those who claim that getting a +15% mod makes you fail more is wrong. Which is all the ogur wanted to show I think.

          Comment


          • #20
            Well, here you go

            To really do this right it would have to be like 100000 combines or some other ridiculous number which is just NOT going to happen [...]but realistically... we all know how fickle random numbers are.

            And realistically, we all know how prone the EQ code is to unintended side effects and second order differences between plan and implementation.

            Perhaps the tester got consistent randomness in 3 tests out of 4. Perhaps something else is boggling the mix. The question is, as the engineer in charge, do you care enough to know for sure?

            That's a subjective question in this case. Perhaps it is truly not worth further inquiry, and there are bigger fish to fry.

            Still, it is no big effort to have a tester set up a macro machine to combine to any desired confidence interval. I expect that the Sony QA shop already has tools to do that. It won't take very long. You can easily run 750 combines an hour (12000 over one night; 47000 over a weekend) if you summon the ingredients with GM powers.

            Comment


            • #21
              Originally posted by thrashette
              To really do this right it would have to be like 100000 combines or some other ridiculous number which is just NOT going to happen [...]but realistically... we all know how fickle random numbers are.

              And realistically, we all know how prone the EQ code is to unintended side effects and second order differences between plan and implementation.

              Perhaps the tester got consistent randomness in 3 tests out of 4. Perhaps something else is boggling the mix. The question is, as the engineer in charge, do you care enough to know for sure?

              That's a subjective question in this case. Perhaps it is truly not worth further inquiry, and there are bigger fish to fry.

              Still, it is no big effort to have a tester set up a macro machine to combine to any desired confidence interval. I expect that the Sony QA shop already has tools to do that. It won't take very long. You can easily run 750 combines an hour (12000 over one night; 47000 over a weekend) if you summon the ingredients with GM powers.
              actually. they do not use macros.

              Macros would not emulate what a player does (well at least not what they are supposed to do)
              Ngreth Thergn

              Ngreth nice Ogre. Ngreth not eat you. Well.... Ngreth not eat you if you still wiggle!
              Grandmaster Smith 250
              Master Tailor 200
              Ogres not dumb - we not lose entire city to froggies

              Comment


              • #22
                Originally posted by Ngreth Thergn
                actually. they do not use macros.

                Macros would not emulate what a player does (well at least not what they are supposed to do)
                Ack! So some people sat down and did 4000 combines! o.0!
                Kyroskrane tells you, 'AwwoooOOOOooAaawwaa!'

                Comment


                • #23
                  ... or 10 people did 400 combines each -- probably just about possible without incurring a risk of RSI.
                  Gaell Stormracer, Storm Warden of Tunare, United Kingdoms, Antonius Bayle

                  Comment


                  • #24
                    That's possible but if so would invalidate any claims about statistical significance as that would be adding new variables to the mix.

                    Comment


                    • #25
                      I was using "19 times out of 20" confidence intervals whenever I talked about confidence intervals or radius's. Ie, the chance that a sample is outside of the top of the confidence interval is 2.5%, and the chance that it is below is 2.5%, and the chance that is anywhere outside is 5%.


                      I will admit that Ngreth showed "no human being could detect the difference between the success rates on a max_success combine with or without a trophy equipped", assuming those people haven't run into some really strange situation. It is far far more likely that those people just ran into an unluckly streak and tied themselves up with superstition.

                      In addition, there are 4 tests there -- that makes 6 pairs of tests. So the chance that something unlikely happens is 6 times more likely than just one test.

                      A 1 in 20 event that has 6 chances to happen is more like a 3 in 10 event. ^_^

                      In addition, for those two data points to be statistically seperated, one needs a success chance at the outer bounds of the confidence interval (above 96% and somewhat close to 96.5%). The lower bounds of the observed success chance (95.25%) is not high enough for any two observed samples to be seperated.

                      So, on sober second thought, it seems as if the samples are not statistically seperatable.

                      However, we have here a 4000 sample test run that implies that the 95% success cap doesn't quite exist.

                      Second EDIT: When faced with a result outside the confidence interval, it's important to think about causation. Do we really think the bug is that the hard-cap on 95% success is broken, but that getting a better trophy fixes it? That's a really strange hypothesis. Given how solid the 95% hard-cap is, and how far away from the conditions which trigger the loosening of that cap (I think if you are 40 points above trivial, unmodified, the hard cap moves to 96%), all the more reason to suspect we're seeing one of the 5 times (or is it ten, did you use a 90% confidence interval) in a hundred that **** happens.
                      /shrug. I just noticed that Ngreth's testrun was statistically incompatable with a 95% success rate.

                      I expect that some testruns will be statistically incompatable with a 95% success run. However, if you see such a test run, it should encourage you to check to see if your hypothesis is correct.

                      Especially a somewhat unique one. If people searched through their logs and managed to find a bad luck streak -- that is probably just the law of large numbers kicking us in the rear. If SOE goes and pays empoloyees to do a test... Well, that test doesn't have the law of large number's on it's side.

                      EDIT: Also, importantly -- as was stated but has been glossed over, what Ngreth's tests DO NOT show, is that the samples are drawn from distributions with different means. To test that, we'd need to see non-overlapping confidence intervals, and I am certain that the 966 sample is NOT statistically significantly different from the other samples. It may be outside the range of expected values for the hypothesized 95% success rate, but that is an easier test to fail than testing whether both samples are drawn from a distribution with a common mean.
                      No, all you need is that the confidence interval of the "difference between the two random samples" does not contain 0. This is slightly different.

                      Practically, it means that if each sample has the same confidence radius "R", then if they are more than sqrt(2)*R apart they (19 times out of 20) have a different mean.

                      Technical discussion:
                      Notation:
                      If X is a random binomial variable, then X^N is that random variable tested N times.

                      X^N/N is the observed success rate.

                      If Y is another random binomial variable, then Y^N/N is the observed success rate.

                      Z = X^N/N - Y^N/N is yet another random variable -- the observed difference.

                      Hypothesis: X and Y are different random variables.

                      The null hypothesis: If X and Y are the same random variable, then the expected value of Z is 0.

                      Let Conf(Z) be the expected confidence radius of Z, assuming the null hypothesis.

                      Then, if AbsoluteValue( Observed(Z) ) > Conf(Z), then it is likely that the null hypothesis is wrong.

                      Assuming the null hypothesis (Y=X), then:
                      Now, Var(Z) = Var(X^N/N) + Var(Y^N/N) = 2*Var(X)/N
                      Conf(Z) = 1.96 * sqrt(Var(Z)) = 1.96 * sqrt( 2 ) * sqrt(Var(X)/N)

                      Now, Conf(X^N/N) = 1.96 * sqrt(Var(X)/N)), so
                      Conf(Z) = sqrt(2) * Conf(X^N/N)

                      So, you can show that two random variables are different while their confidence radius's overlap. They just can't overlap that much.

                      (Intuitively, this is the mathimatical description of the fact that it is "unlikely" that X will get a "low roll" while Y gets a "high roll".)
                      --
                      I am not the Yakatizma you are looking for.
                      No, really.

                      Comment


                      • #26
                        ack...math...

                        somebody make it stop!
                        Master Artisan Maevenniia the Springy Sprocket Stockpiler of the really long name
                        Silky Moderator Lady
                        Beneath the silk, lies a will of steel.

                        Comment


                        • #27
                          For Maevenniia
                          pi=3.1415926535 8979323846 2643383279 5028841971 6939937510 5820974944 5923078164 0628620899 8628034825 3421170679...
                          Shawlweaver Sphynx on Cazic Thule
                          Master Artisan Aldier on Cazic Thule

                          Comment


                          • #28
                            Originally posted by Chakua
                            Can we get a recipe for Aspirin, please? This thread gave me a pounding headache.


                            .
                            Redi of Qeynos
                            Warder of Tunare
                            http://www.thekeepers-eq.orgThe Keepers

                            Comment


                            • #29
                              acetyl salicylic acid is the aspirin just FYI

                              Comment


                              • #30
                                Originally posted by Aldier
                                For Maevenniia
                                pi=3.1415926535 8979323846 2643383279 5028841971 6939937510 5820974944 5923078164 0628620899 8628034825 3421170679...
                                Did you know 9 is the billionth digit of pi? (I'll believe the super-computer that calculated it, you can check by pencil if you like.)


                                Gorse

                                Comment

                                Working...
                                X