Quantcast
Channel: Noodling in the data stream
Viewing all articles
Browse latest Browse all 48

How many zeroes in that Poisson?

$
0
0

I have a data set, and some of the variables there are counts of a given event.

Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process.

You could check whether the variance is more or less equal to the mean, but with real-life data you can bet that there will be a difference between the two, and you'll be left scratching your head as to whether it's too big for Poisson, or just about right to pass the smell test.

Another thing you can do is check whether the count variable shows the right number of zeroes. In a Poisson distribution, the marginal probability of a zero outcome is exp(-mean). If the proportion of zeroes that you see is a lot higher than this value, and it usually is when you're looking at counts of rare events, then you will have to consider a zero-inflated poisson or a finite-mixture model, as discussed with wonderful clarity in chapter 17 of Microeconometrics using Stata by Cameron and Trivedi.

That brings me to the immediate cause for this post. I thought I'd code up a quick program to check those zeroes for a given data set and count variable, and I did this:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den r(N)
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num r(N)
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Then I ran the thing, and it kept turning up a share of 1.0, that is 100% zeroes observed, no matter the data set or the variable of interest y. You know why? Because local den r(N) will evaluate to r(N), and that will be filled in by the last command that returns such a thing before `den' is invoked. That command is sum `y'. The same thing happens to `num'. So I took the ratio of the same number. The returned values from the calls to count that I had made right before defining both local num and local den were quietly obliterated. Isn't that a sneaky bug? The correct code is below:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den `r(N)'
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num `r(N)'
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Now `den' stores the actual value returned by the count before it was defined as local den `r(N)', as intended. Mind your apostrophes, is all I'm saying.

Erratum: actually, that's not all that I should have said. As Nick Cox observes in the comments below, you should mind your equal signs too.


Viewing all articles
Browse latest Browse all 48

Trending Articles