Data visualization, Budapest Ethnography Museum

April 19, 2011, 7:29 am

Goods on the Pest market and their origins around 1736

Big circles designate villages that sent goods to the market. Small circles designate other villages. Double lines mean postal roads, single lines mean other public roads. The symbols on the list mean (in this order): cattle, pigs, poultry, grains (autumn, spring), bread, fruit, melons, cabbage, root vegetables, butter, wood, hay.

If anybody reading this ever goes to the Budapest Ethnography Museum, runs across this poster, and takes a better picture, I would be grateful for a copy. My cell phone photography skills are nothing to brag about.

↧

Supply chain magic

May 16, 2011, 10:32 am

≫ Next: Building a repeater bridge

≪ Previous: Data visualization, Budapest Ethnography Museum

$\begin{tabular}{rll} \hline \hline \multicolumn{3}{ c }{New computer on its way home: halfway across the world in one weekend} \\ \hline {\bf Date/Time} & {\bf Activity} & {\bf Location} \\ 5/16/2011 7:34 & On FedEx vehicle for delivery & DURHAM, NC \\ 5/16/2011 6:26 & At local FedEx facility & DURHAM, NC \\ 5/15/2011 22:10 & In transit & RALEIGH, NC \\ 5/15/2011 16:08 & Departed FedEx location & MEMPHIS, TN \\ 5/15/2011 1:13 & Arrived at FedEx location & MEMPHIS, TN \\ 5/14/2011 15:56 & Departed FedEx location & ANCHORAGE, AK \\ 5/14/2011 13:38 & Int'l shipment release & ANCHORAGE, AK \\ 5/14/2011 8:24 & Arrived at FedEx location & ANCHORAGE, AK \\ 5/14/2011 16:22 & In transit & SHANGHAI CN \\ 5/13/2011 22:50 & Picked up & SHANGHAI CN \\ 5/13/2011 6:58 & Shipment information sent to FedEx & \\ \hline \end{tabular}$

↧

Building a repeater bridge

June 25, 2011, 2:24 pm

≫ Next: All the SAS you need

≪ Previous: Supply chain magic

I turned a 7-year old Motorola WR850G (v3) 802.11b/g router into a repeater for a dual-channel Netgear WNDR3700 that I bought a few weeks ago. Dual-channel means that the thing works over both 2.4GHz and 5GHz. More on that later. The job took some googling, but flashing an old router and installing third-party firmware turned out to be a lot less scary than I thought it would be. Here are some lessons learned:

1. If your wifi signal is suddenly weaker and spottier after years of perfect service, that doesn't mean that your old router is dying. Maybe you moved it somewhere new, for example from the middle of the basement to the side of the living room under the TV set on the main floor. If your your home office is on the top floor, you might be expecting no loss in signal strength, because the difference in elevation is now smaller. But the middle of the basement may have been closer as the stone falls than the side of the living room is. Maybe new gear in the house interferes with the router. Or maybe existing gear that you have where the TV sits interferes with the router where it is now, and it didn't when the router was in the basement. All of that sounds plausible to me.

2. If the old router still works, the new router will probably not give you any better results, especially if you place it where you moved the old router. If moving the old router back to its old place is an option, just take the new one back to the store and don't bother reading further.

3. Otherwise, your best option may be to flash the old router, put DD-WRT on it, then turn it into a repeater bridge. The bridge picks up signal from the router and the computers pick it up from the bridge. This will work if the old router is better at receiving weak signal than the computers of interest are. If this is true, then a repeater sitting in the home office can pick up signal from the new modem more reliably than the computers there can, and re-broadcast it to them.

4. There are four kinds of 802.11 wifi signal that I know of: a, b, g and n; a works in the 5GHz band; b and g are 2.4GHz; n works over both. As you go up the alphabet from a to n, higher letters are newer. The data transfer speed also grows in that direction from a maximum of 11Mbps to 300Mbps. Fancy dual-channel routers transmit both 2.4GHz and 5GHz signal. 2.4GHz carries farther. Your repeater will pick up more n-signal bars in the 2.4GHz band than it will in the 5GHz band.

5. Your router drops the data transfer speed to that of the slowest device in the network: if your laptops, smartphones, etc. are 802.11n but the wireless printer is 802.11g, then while you print all of your devices exchange data at 802.11g speed. That is a theoretical maximum of 54Mbps, which is still a lot faster than the typical internet connection, so you will notice no loss in how well YouTube works while you print. Also, since n-grade speeds resume after you're done printing, and you probably don't print that often, you probably won't want to upgrade a perfectly good wireless printer on account of its wifi capabilities alone, but read the bit below, about encryption.

6. There is another reason why you might be getting g-grade speeds from an n-capable router: n-grade speeds are not possible if you use WPA-TKIP encryption. This means that if you choose this type of encryption, the g-grade speed you get is permanent. As of this writing, you should use WPA2-AES encryption. Some gear calls it WPA-AES. It's the same thing. AES encryption is the only consumer-grade encryption that's any use, though probably not for long. Your other choices, WEP and WPA-TKIP, are demonstrably useless. Both have been cracked a long time ago. If your cable guy set up your router with a WEP numeric key, change it. In particular, if your router proposes mixed WPA-TKIP/AES encryption (it might call it WPA-TKIP/WPA2-AES) don't pick that. Choose pure AES. The encryption on the repeater should also be AES. If your g-grade wireless printer offers WPA-TKIP but not WPA2-AES encryption, update its firmware. If that does not help, you do have two options that I can think of short of buying a new printer. You could connect it to the repeater bridge by USB or Ethernet cable, and have the bridge act as a wireless print server. Or you could make the printer join the wireless network through a wireless-n USB dongle. I didn't try either of them.

7. If in the process of installing new firmware your old router seems to be hanging in the "reboot" mode, with the power LED glowing red, and that's going on for a very long time, maybe you bricked your router. Then again, maybe you didn't. It happened to me, and when after about 10 minutes I still didn't see all the lit LED's settle into nice solid green, I gave up and unplugged the thing. When I plugged it back in, I was greeted by a working DD-WRT machine. Good surprises happen.

↧

All the SAS you need

July 1, 2011, 12:12 pm

≫ Next: Factors in Stata and R

≪ Previous: Building a repeater bridge

You may find yourself on a job where people use SAS, but you would rather use Stata. If you have both SAS and Stata installed on your computer, you can simply put Dan Blanchette's usesas to work. That's all you need.

If you don't have SAS installed, make your colleagues' lives easier and give them a script that lets them send data from SAS format to csv in bulk. Below is an example. You could call it gimmedata.sas.

The script starts by defining some lists of stubs for file names that follow the pattern "stub_2011mmdd.sas7bdat". You can alter these lists to suit your situation. It then loops through them and turns SAS data sets into .csv files:


%let theuser  =Gabi;
%let thepath  =C:/Users/&theuser/myproject/;
%let datain   =&thepath.sas_data/;
%let csvout   =&thepath.sas_to_csv/;
%let stubs    =dw dw_mean hm matrix pop fore;
%let days05   =05 13 20 27; /* dates in May */
%let days06   =03 09 16 22; /* dates in June */

libname  mylib "&datain";
title;

*options mcompilenote=all; /* default is none */
*options mprint mlogic; 

/* check if a data set exists, send it out as csv */
%macro getfile(stub,month,day);
	%local thefile;
	%let thefile=&stub._2011&month.&day;
	%if %sysfunc(exist(mylib.&thefile.)) %then %do;
		PROC EXPORT DATA=mylib.&thefile.
        	OUTFILE="&csvout.&thefile..csv"
        	DBMS=CSV REPLACE;
    		PUTNAMES=YES;
		RUN;
	%end;
	%else %put Data set &thefile does not exist.;
%mend getfile;

/* files come in every week, at dates listed
in &&days&month. so, run the %getfile macro
through all of those dates */
%macro getfiles(stub,month);
	%local thelist;
	%local count;
	%let thelist=&&days&month; /* going through days this month */
	%let count=0;
	%do %while(%qscan(&thelist.,&count.+1,%str( )) ne %str());
		%let count=%eval(&count.+1);
		%let day=%scan(&thelist.,&count.);
		%getfile(&stub,&month,&day);
	%end;
%mend getfiles;

/* weekly files come with names that start
with the stubs listed in &stubs. so, run
getmonth through all those stubs */
%macro getmonth(month);
	%local thelist;
	%local count;
	%let thelist=&stubs; /* going through file name stubs */
	%let count=0;
	%do %while(%qscan(&thelist.,&count.+1,%str( )) ne %str());
		%let count=%eval(&count.+1);
		%let stub=%scan(&thelist.,&count.);
		%getfiles(&stub,&month);
	%end;
%mend getmonth;

/* now call your macros */
%getmonth(05);
%getmonth(06);

↧

Factors in Stata and R

August 22, 2011, 12:16 pm

≫ Next: How I rooted my Nook Color

≪ Previous: All the SAS you need

The quick version of this post goes like this:
-- # in Stata is : in R
-- ## in Stata is * in R.

The long version is that both Stata and R handle very nicely factor variables in regression models. If you want a full-factorial interaction between a factor variable x1 and a continuous variable x2, the Stata way is to say


regress y i.x1##c.x2

whereas the R way is to say


lm(y~factor(x1)*x2)

Now, if you just want to interact x1 with the slope of x2, the Stata way becomes


regress y i.x1#c.x2

whereas the R way becomes


lm(y~factor(x1):x2)

That's all.

↧

How I rooted my Nook Color

August 25, 2011, 12:25 am

≫ Next: Stata 12 with MacVim

≪ Previous: Factors in Stata and R

The job takes one microSD card and two files that you download from the Internet: a CWR .img file, that you use to make the microSD card bootable; and a ManualNooter .zip file, which you do not unzip; you simply copy it onto your bootable microSD.

You will google this, and instructions of different vintage will reference different version numbers for these two files. As of this writing, for a Nook Color that came in a box with a blue dot on it, you need the 3.2.0.1 (eyeballer) version of the CWR, and the 4.6.16 version of the ManualNooter. Further instructions are here and here.

I used the 1G microSD card from my decommissioned BlackBerry Pearl, an embarrassment of a smartphone if there ever was one. The card is over three years old and it's class 4, but it worked. You will google this, and you will read that it's best if the card is class 6 or better. I'm sure that's right.

I used a Mac to make this card bootable. There are instructions for this all over Google. I liked best the ones here. They came with a screenshot of the terminal.

The Mac terminal takes a bit of typing, especially if you saved the CWM .img file somewhere awkward, like /Users/you/Documents/root_the_nook. But if you locate the .img file with the finder, you can drag it and drop it right into the command line. Who knew?

My first attempt looked alright, but any Market download I tried failed. That's when I switched from CWR 3.0.2.8+ManualNooter 4.5.2 to CWR 3.2.0.1+ManualNooter 4.6.16 -- because the more you fail, the more you google. The second try fared worse than the first: Market shut down as soon as I tried to open it, and so did Gmail.

Before my third attempt I de-registered the NC and put it back in pristine factory condition using instructions I googled further and found here. If you want to do as I did, scroll down to the post by Colchiro that mentions "wipe Davlik cache". Do as it says.

Then I re-registered the Nook, and proceeded with the rooting once again. Three was the charm. Now I have an Android Froyo tablet. I have no idea what I'll do with it. I installed a doodling app; Kate might like it.

General lesson learned: forget about banks. I've seen too big to fail, and it's Google. If it shut down tomorrow, the world would end.

↧

Stata 12 with MacVim

September 14, 2011, 7:30 am

≫ Next: From Stata to Google Maps

≪ Previous: How I rooted my Nook Color

I used to run Stata 10 with Vim on Windows. Now I run Stata 12 with MacVim.

In Windows, there is a nice way to integrate Stata and Vim based on the work of Friedrich Huebler and Dimitriy Masterov.

A fairly straightforward combination of bash scripts, Vim functions and Applescript calls can achieve the same behavior in MacVim. I got it to work last night. Thank you, Phil.

↧

From Stata to Google Maps

October 7, 2011, 8:02 am

≫ Next: On work

≪ Previous: Stata 12 with MacVim

At the Stata command line, type "findit geocode". You will turn up a command that matches physical addresses with latitude and longitude coordinates using the Google Maps API.

Then if you type "findit writekml" you will turn up my first contribution to the SSC: a command that writes a KML file using latitude and longitude coordinates in a Stata data set. Enjoy.

↧

On work

October 14, 2011, 8:41 pm

≫ Next: Mapping Durham

≪ Previous: From Stata to Google Maps

Last week Ezra Klein wrote this. His main point, the way I read it, is that the Obama stimulus was insufficient. But he makes a few other claims, one of which is that the government should have stopped firing people and it should have even hired more, even in totally bogus make-work jobs -- one assistant to every park ranger, three trainees to every firefighter, etc. He reasons that in drawn-out recessions the labor market is not quick enough to reassign people to productive jobs. Instead, people stay unemployed until their skills erode, which makes them even less employable, which depresses demand further.

Skill erosion is a legitimate worry, but I don't think Ezra Klein gets it right. Drawing a paycheck is not the only kind of productive work. Whether employed or not, as long as we are in the workforce we always do a combination of learning and paid work. We change the mix as we go along according to the going price of our time at the moment. When we are out of a job, our time is cheap. This ensures the best possible return on learning. Basically, when we're not busy drawing a paycheck, we're busy growing our stock of human capital.

A make-work job increases the going rate of the beneficiary's time, which makes learning more expensive. The rational response is to learn less. If most beneficiaries of make-work jobs respond rationally, these jobs will cause skill erosion, not cure it.

That may not be all. Your stock of human capital isn't only good for making you ready to make a living when employers start hiring again. That is only its more obvious use. The less obvious one is this: the more you know, the more you know what makes you happy. That improves your odds of living a good life. It may be one of comfort, adventure, contemplation or service. Whatever it is, I would not be surprised if getting there took at least some learning: think of it as serendipity with a little help from you.

So it seems to me that make-work jobs only reduce the growth rate of human knowledge both about things that we can do for profit and about things that make us happy. This makes them a good way to keep us in the doldrums -- both in our heads and in our wallets.

↧

Mapping Durham

November 19, 2011, 8:23 am

≫ Next: How many zeroes in that Poisson?

≪ Previous: On work

Today, Kirstin wanted to make a grocery trip to the Whole Foods at Bull City Market, then take Kate to the nearest playground. That seems to be Oval Drive Park, but it won't be obvious from querying the Durham Park Locator .

No worries. The Durham Park Locator gives you a pretty nice table with all 55 playgrounds as of today. You load it into Stata, hit it with geocode and writekml, and you get this Google map. Easy.

On this occasion I also discovered that my writekml, as submitted, had a tiny bug. I submitted the fix a few minutes ago. It should be up by your next update ado.

↧

How many zeroes in that Poisson?

January 26, 2012, 8:11 pm

≫ Next: A quick tip for using Stata in interactive mode

≪ Previous: Mapping Durham

I have a data set, and some of the variables there are counts of a given event.

Four count outcomes, the easiest thing to do is a Poisson regression, but before you do that, it's worth asking if what you see there really is close enough to a Poisson process.

You could check whether the variance is more or less equal to the mean, but with real-life data you can bet that there will be a difference between the two, and you'll be left scratching your head as to whether it's too big for Poisson, or just about right to pass the smell test.

Another thing you can do is check whether the count variable shows the right number of zeroes. In a Poisson distribution, the marginal probability of a zero outcome is exp(-mean). If the proportion of zeroes that you see is a lot higher than this value, and it usually is when you're looking at counts of rare events, then you will have to consider a zero-inflated poisson or a finite-mixture model, as discussed with wonderful clarity in chapter 17 of Microeconometrics using Stata by Cameron and Trivedi.

That brings me to the immediate cause for this post. I thought I'd code up a quick program to check those zeroes for a given data set and count variable, and I did this:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den r(N)
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num r(N)
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Then I ran the thing, and it kept turning up a share of 1.0, that is 100% zeroes observed, no matter the data set or the variable of interest y. You know why? Because local den r(N) will evaluate to r(N), and that will be filled in by the last command that returns such a thing before `den' is invoked. That command is sum `y'. The same thing happens to `num'. So I took the ratio of the same number. The returned values from the calls to count that I had made right before defining both local num and local den were quietly obliterated. Isn't that a sneaky bug? The correct code is below:


capture prog drop checkZifPoisson
program checkZifPoisson

version 12

args y dataset

local f _col(60) %5.2fc

di ""
useData `dataset' // nevermind how this is coded up
di ""
di "Checking `dataset':"
qui {
   count
   local den `r(N)'
   replace `y'=0 if missing(`y')
   count if `y'==0
   local num `r(N)'
   sum `y'
   local zpois exp(-`r(mean)')
   local zobs `num'/`den'
}
di "Share of `y'=0 in a Poisson process: " `f' `zpois'
di "Share of `y'=0 observed: " `f' `zobs' 

end

Now `den' stores the actual value returned by the count before it was defined as local den `r(N)', as intended. Mind your apostrophes, is all I'm saying.

Erratum: actually, that's not all that I should have said. As Nick Cox observes in the comments below, you should mind your equal signs too.

↧

A quick tip for using Stata in interactive mode

February 8, 2012, 10:09 am

≫ Next: Fighting the R graphics

≪ Previous: How many zeroes in that Poisson?

You don't always want to start a do-file in the editor for every small thing, though I usually do, and then trash it if I don't need it. So, my default stance is that I want to preserve work for later.

Yours may be the opposite. If so, one option is to type in the Command window. If you decide that you do want that work preserved for later after all, you can always save the content of the Review window as a .do file.

Another option is to have this in your profile.do file:


// log today's interactive commands
cmdlog using "~/data/cmdlogs/cmdlog `c(current_date)'.smcl", append

This saves a running log with everything you typed at the command line on a given day, in the folder data/cmdlogs. This will save the commands, but not the output (that's the difference between calling cmdlog as opposed to log).

More on this topic here. That may well be where I got the idea to put this in my own profile.do, but if anybody thinks otherwise, I'll be glad to append this post with the correct credit.

↧

Fighting the R graphics

February 13, 2012, 5:29 pm

≫ Next: Stata 12 with MacVim, updated

≪ Previous: A quick tip for using Stata in interactive mode

If you've ever seen the Error in plot.new(): figure margins too large message before, this is the best overview of the problem that I could find anywhere.

There can be a lot of knobs to turn when it comes to graphics, no matter what statistical programming environment you use. In R, typing par() at the prompt will list them all.

↧

Stata 12 with MacVim, updated

February 27, 2012, 8:35 am

≫ Next: Human rights stats, part 1

≪ Previous: Fighting the R graphics

A while back I showed how to get Stata 12 to work with MacVim. This is to let you know about a bug fix. I posted the details on the Statalist just now. If you're reading this blog and you're not also a Statalist subscriber, you may want to change that.

↧

Human rights stats, part 1

March 3, 2012, 7:39 am

≫ Next: Human rights stats, part 2

≪ Previous: Stata 12 with MacVim, updated

I follow @simplystats on Twitter, and on March 1 they had a post that linked to an article in Foreign Policy about a guy who has the coolest job in applied stats. He works here.

The original piece described a quick algorithm that you can use to estimate the number of human rights violations using a technique first devised for counting fish in a pond. The gist of it is this: catch and release fish over two days. Tag the fish caught on the first day. Count each day's catch and the number of fish caught twice. That is the overlap. To estimate the number of fish in the pond, multiply the two days' catches and divide the total by the overlap.

I had a data set of insurance claims in Stata's memory at the time of my reading, with observations uniquely identified by a variable named claim_id.

I decided to use it as the model of a pond with as many fish in it as observations in my data set, so I wrote a little fishing program. It takes one argument: some round upper bound of the number of fish I might catch in a day. I'll call it n. It can be 100, or it can be 1,000. Here:


// try MSE
capture prog drop guessObservations
program guessObservations

args n // upper bound of a day's catch.

qui {
   local day1fishcount=int(runiform()*`n')
   local day2fishcount=int(runiform()*`n')

   forvalues i=1/2 {
      preserve
      tempfile day`i'fishlist
      sample `day`i'fishcount', count
      keep claim_id
      save "`day`i'fishlist'", replace
      restore
   }

   preserve
   drop _all
   use "`day1fishlist'"
   merge 1:1 claim_id using "`day2fishlist'"
   count if _merge==3
   local overlap=r(N)
   restore

   local totalfish=`day1fishcount'*`day2fishcount'
   if `overlap'>0 {
      local totalfish=`totalfish'/`overlap'
   }
   count
   local truect=r(N)
}

local fmt _col(30) %10.0fc
di ""
di "Fish caught on day 1:" `fmt' `day1fishcount'
di "Fish caught on day 2:" `fmt' `day2fishcount'
di "Overlap:"              `fmt' `overlap'
di "Estimate:"             `fmt' `totalfish'
di "True count:"           `fmt' `truect'

end

My data set has some 150,000 observations. Choosing a small n, say guessObservations 100, sets me up for an overlap of zero, but even so the two catches multiplied together won't even come close to the true size of the population. This is a technique for counting hungry fish in a small pond, not in an ocean. The size of the daily catch should be representative of the total, so you can have some decent overlap.

Setting n=1,000 keeps it small enough relative to the total population that it's still possible to have zero overlap, but n is now large enough to overshoot wildly in that case. If I catch 900 fish each day with zero overlap, I will guess that there are 810,000 fish there. However, an overlap as small as 5 will get me pretty close to the true population.

Setting n=10,000 performs much better. I may still have a day when the fish won't bite, and get this:


. guessObservations 10000

Fish caught on day 1:                49
Fish caught on day 2:             4,182
Overlap:                              3
Estimate:                        68,306
True count:                     157,638

But with any luck, I will probably get this:


. guessObservations 10000

Fish caught on day 1:             9,662
Fish caught on day 2:             3,220
Overlap:                            220
Estimate:                       141,417
True count:                     157,638

The larger n, the larger the overlap, and the better the precision. That makes sense: in the limit, the true number times itself divided by itself will yield the true number.

But does n have to be very large relative to the size of the population? And does my guess -- or the uncertainty surrounding it -- depend on what probability distribution function I assume for the daily catch? Next time I'll be doing some simulations.

↧

Human rights stats, part 2

March 4, 2012, 7:44 am

≫ Next: Human right stats, one last thing

≪ Previous: Human rights stats, part 1

My previous post promised some simulations. To refresh your memory, I am trying to see how reliably Multiple Systems Estimation, as described here, can guess the true number of fish in a pond.

The density plot below tells the story. The true number of fish is 150,000. Each catch limit lets you make a best guess, which is the x-coordinate of the peak of its associated bell curve. The shape of each bell curve measures the uncertainty surrounding the guess: the flatter the bell, the more uncertain the guess. Perfect foresight would be a spike at the 150,000 x-mark. Curves that peak away from that mark make biased guesses.

It is obvious that larger daily catch limits allow you to guess better. Very low catch limits set you up for severe downward bias. The red curve, corresponding to a daily catch of up to 500 fish, cannot help underestimating the true population size, for reasons discussed in the previous post. Then there seems to be a range of catch limits that improve on the bias, but increase the uncertainty horribly -- that's the green curve. So, you need higher limits, but you don't have to go crazy. The gain in precision after some point is not worth it: though a limit of 20,000 is very accurate, a limit of half that is not much worse.

I was inspired to run this exercise by a class I'm taking. The work would have been slower without help from here (I recommend the book, I bought it) and here. The picture would not have looked this good without ggplot2 (you should get that book too). All errors are my own. Here's the code:


# http://www.foreignpolicy.com/articles/2012/02/27/the_body_counter?page=full
# simulate MSE = Multiple Systems Estimation

# SOME HOUSEKEPING FIRST

# pretty picture comes from here
library("ggplot2")

# true population size
population 0) {
      guess

↧

Human right stats, one last thing

March 5, 2012, 7:55 am

≫ Next: Turn a date into Stata format quickly

≪ Previous: Human rights stats, part 2

The R code in my previous post could also produce the picture below. The implication is this:

A small sample is still bad news. It is biased toward underestimating the population. There's nothing you can do about that. The larger the sample, the better. How large a sample do you need? You might get lucky with as little as 1,000, for the reason that I mentioned in my first installment on the topic: small samples only need a small overlap to guess pretty well. That's why the green curve now peaks at the true population mark. But you'd have to be lucky, as my previous picture demonstrates by counter-example. And even a correct guess will be surrounded by a lot of uncertainty if you have a small sample: the green curve is still the flattest of the three that guess correctly. Finally, the gain from increasing the catch limit from 10,000 to 20,000 is not trivial after all: the purple curve is quite a bit peakier than the blue one.

What this simulation shows is that MSE relies on having representative samples of the true population. There's no way out of that requirement. You also want to run your code more than once. Though you will be able to dismiss easily sample sizes that are clearly too small, there may be a range of sample sizes that can provide false comfort. I could have easily seen this picture first and concluded that 1,000 isn't great, but it still hits the mark, so maybe it's good enough. That would have been wrong. On the other hand, now I'm pretty sure that 10,000 is still alright, though not as good as it looked before.

↧

Turn a date into Stata format quickly

March 18, 2012, 8:02 am

≫ Next: Stata for stocks

≪ Previous: Human right stats, one last thing

There's a little program that's shown up more than once now in my housekeeping do-files, so it may be useful enough for a blog post, but it doesn't quite warrant a spot in c(sysdir_personal) as a stand-alone ado-file. Here:


// turn this date to Stata format
// if it's not that way already
capture prog drop setStataDate
program setStataDate

args v fmt // fmt can be MDY or YMD
capture confirm string variable `v'
if _rc==0 {
   local l`v' : variable label `v'
   gen x=date(`v',"`fmt'")
   format x %td
   drop `v'
   rename x `v'
   label variable `v' "`l`v''"
   order `v'
}

end

I use it with data sets derived from merging other data sets. It's useful if in the original data sets there are string dates in mixed formats -- maybe YYYY-MM-DD in the "master", and MM/DD/YYYY in the "using" -- or if these string dates have labels I want to keep. So, you see why it's not clear that this is worth an ado-file. I don't want to type all the code between the curlies more than once, but usually I don't have to.

I do want to be able to call this program by name, as in setStataDate somedate MDY from within another program, then forget about it, safe in the knowledge that it won't make any difference if somedate is already in Stata format. That's the job of the if-condition you see there, and this is all this little program does.

↧

Stata for stocks

March 20, 2012, 10:00 am

≫ Next: Do-file rules, revisited

≪ Previous: Turn a date into Stata format quickly

The people at StataCorp are on Facebook, and the other day they linked to this blog post by Paul Clist about checking on a stock you might own through clever use of the stockquote Stata command.

Last year I bought some Netflix stock when it fell to $77 after the Qwikster fail. I agreed with the general public that it was a stupid idea, but I still thought that the hit their stock took was a bit of an overreaction. The streaming business was still good. Maybe not $250 per share good, once the content suppliers would catch wise and raise their prices, but my family was still happy with it as a TV substitute. That's about the full extent of thought I'm going to ever put into picking any stock, so don't be too surprised that I don't make big bets. This one was just shy of $500 -- whatever round number of shares plus the broker's commission fit there.

Still, it was just never clear to me how good this choice of spending $500 was relative to the Nasdaq Composite. Of course, I could have looked it up on bigcharts.com, but why not have a picture with the real dollars I have at stake on the y axis, and my true time line on the x axis? It wasn't too hard to expand Paul's code to a set of programs that can take any of the four stocks in my toy portfolio and put it against some appropriate stock market index, to show how it's been doing in one quick tsline graph. Here's Netflix as of last Friday:

My code takes the starting dollar amount from my brokerage statement, and it augments both the holdings and the index baseline with any subsequent purchases or sales (there aren't any in this case). This way I can simply collapse (sum) both the "index" (transformed into a price-per-unit times the units held of each stock following Paul's formula) and the valuation of each stock into daily totals, and plot the performance of the whole portfolio relative to my stock index of choice, with actual dollars on the y axis.

I like it. It works fine for me. I already use Stata to do the household's budget, I used it to compare true costs to own of water heaters when I was in the market for one, and I used it to track wet and dirty diapers when my kid was a few days old. So, thank you, Paul, for helping me find yet another civilian use for this fine piece of software.

↧

Do-file rules, revisited

May 22, 2012, 12:06 pm

≫ Next: What I’m up to

≪ Previous: Stata for stocks

Back in 2009 I wrote this post, detailing what at the time I thought would be a good way to write do-files. Some of the ideas there have stood the test of time. Others haven't. The changes are driven by Stata's evolution, by new things I've learned and by ways that my work changed. This is a quick review.

First, as of Stata 12, you don't need to set memory anymore. Second, clear should now be replaced by clear all. In addition, J. Scott Long recommends that you type macro drop _all right after it. I know this because I'm reading his Workflow book right now. I know I'm three years late (two if you count from where one of this blog's readers recommended it to me for the first time). I'm still finding useful stuff there. Next, as Jess suggested in the comments thread to the original post, I now use set varabbrev off.

Finally, my do-files no longer have a Globals section. Instead, there's now a program that defines all macros in one place as local macros and returns them. I first started doing this back in 2010, as detailed here. At the time it seemed like a slick thing to do. I just assumed, wrongly, that this would be faster at execution time than the original solution that gave me the idea (using a separate do-file, called with include).

The staying power of having a program for defining locals came not from its execution speed, but from its versatility. You can define all the locals you want inside a program you call, say, setLocals. If you need more locals as the requirements for your code grow, you just pile them on inside this program, and remember to also return them.

Then, whenever any specific local macro is needed, you call setLocals and only recover the `r()' value that you need then. Locals can be substituted for the obvious things -- like operating system-specific file paths or hard-coded numbers -- and also for names of programs you define somewhere else. This will also spare you the inconvenience of reading a do-file where `this' local shows up all of a sudden and if it's not obvious what it holds, you must work your way up to see where it was defined: if all locals are defined in setLocals, you will always know where to look.

This is probably a terrible way to use memory, but the convenience of having all my locals defined in one place and any arbitrary subset of them available with the same simple call to setLocals is well worth it. You can extend this model in all sorts of ways, with some care. You can, for example, give setLocals an optional argument (using something as flexible as syntax [anything]) so its behavior is changed according to whether the argument is present. For example, a call to setLocals on its own will return a default set of local macros that apply everywhere; a call to setLocals andAlsoTheseOtherLocals will return the default macros plus a set defined inside a second program, andAlsoTheseOtherLocals, to be used in some specific context.

↧

Latest Images