Jerome미팅 정리

- General Linear Model은 Normality와 모든 treatment가 Normal Distribution돼야 하는데 Count Variable같은 경우는 보통 Poisson모델로 한다.

- Data dispersion: Variance 가 Mean보다 많이 클때
  1. Poison 보다 Negative Binomial쓴다. (Generalized Linear Mole in SPSS)
  2. 혹은 Count에 Square Root를 취해서 data dispersion을 없앤다.


Note that Poisson mean and variance are equal, so using GLM is likely bad as if mean changes with the factor combinations, so will vary, so homogeneity of variance will be violated. One can do a square root transformation

(See e.g. http://en.wikipedia.org/wiki/Poisson_distribution )



or otherwise use Poisson regression.... see e.g.

Poisson regression and overdispersion (SPSS) at

http://www.ats.ucla.edu/stat/spss/dae/poissonreg.htm

http://www.ats.ucla.edu/stat/spss/dae/neg_binom.htm
http://www.ats.ucla.edu/stat/spss/faq/dummy.htm


Over-dispersion (variance larger than mean... seems to happen for counts above at the various 8 (degree*job level = 4*2 = 8 levels). This might mean using Negative binomial rather than Poisson regression, but square root transformation followed by ordinary GLM will likely give the same results (square root transformation stabilizes variance... makes variance independent of mean in particular, so we do not worry about the heterogeneity of variances at the 8 treatments)..

Model에서 offset는 쓰지 않았지만 설명해줫는데 예를 들면 상해가 연길보다 더 많은 아이큐가 높은 사람이 있다고 한다면 실제로 인구차이때문에 그럴 수 있으므로 population portion을 고려하는 옵션.


We fitted a negative binomial regression model with response being Count and inputs Job and Educ. (We chose negative binomial rather than Poisson regression due to possible over-dispersion as seems indicated by the summary statistics at the 8 treatments (see above).


- Big data sample의 문제점

Caveat about large sample sizes:

http://blog.minitab.com/blog/statistics-and-quality-data-analysis/large-samples-too-much-of-a-good-thing


Statistics를 쉽게 쉽게 잘 접근할 수 있게 배워주고 무엇보다 거리감이 확 없다.. 언제든지 사무실을 메일 보내고 찾아가면 반갑게 맞이해주는 느낌... 외국에서 첫 수업인데 많이 배운것 같고 논문에 잘 적용해봐야 할 것 같다.