[buug] grep weirdness

Tue May 21 12:11:28 PDT 2002

On 21 May 2002, Ian Zimmerman wrote:

itz> Any reason why tmp cannot be simply the following?
itz> 
itz> ^CASEID	
itz> ^[0-9]+	
itz> 
itz> And, given that _all_ records of your sample file seem to match the
itz> one of the original regexes, what problem are you really trying to
itz> solve here?

D'oh!  My description of "tmp" is confusing.  You're correct in that when
there's a one-to-one match between the regexes in the pattern file "tmp"
and the values of the subset in "mrw", a paste or join command will merge
the files together correctly (and, in fact, that's what I generally do in
such a circumstance).

Where a paste or join command will not work is when there are fewer
regexes in "tmp" than there are records in "mrw".  It's in this case that
I turn to the grep statement.  (And this should explain why simply
matching on numeric digits isn't feasible.)

(I've moved the description of the files to the bottom of this email for
reference.)

First, let me say that (as I noted in the original email) I've already
solved my particular problem.  My email, therefore, is more of a general
inquiry: "Why does grep run fine when there are 1,000 regexes in the
pattern file but raise an error when there are 8,000?"

Okay, let me try to explain what's going on and why I'm doing things the
way that I am...

This is all part of the dataset cleaning and manipulation for my thesis.
The original dataset is 8,916 respondents; the subset that I'm using is
somewhere around 1,100 respondents (I'm still in the process of subsetting
my data, so I don't yet know what the final size of the subset will be.
Right now, it's at 1,110 respondents.)

The variable "relative_person_weight" was part of the original dataset; I
needed to create it myself.  I don't think that I need to go into the
details of how the variable was calculated as it just further complicate
this discussion; however, it is important to know that it needed to be
calculated from the full dataset of 8,916 respondents.

So, basically, what I've got are two datasets: a full dataset _with_ the
relative_person_weight variable and a subset _without_ the
relative_person_weight variable.  And what I need to do is merge this
variable into the subset (for each record which is present in the subset.)
Essentially, this is the same thing as an inner-join in SQL.  Make sense?

Hopefully, this explains the problem.  I solve it by using a grep
statement (paste and join won't work since there are a different number of
records in each dataset). Specifically, I use grep to identify records
that match on a particular unique identifier (i.e., CASEID).  I use the
CASEIDs from the subset as patterns to match against the full dataset
(with the variable to merge in).  Generally, this works fine.  However,
yesterday I discovered that if the subset is over a certain size, I'll get
a "grep: Regular expression too big" error.  I'm not sure exactly what
this size limit is -- 1,110 records works fine, 8,916 doesn't.

The reason that I discovered this is that yesterday I happended to use my
grep statement against a full dataset instead of a paste.  That shouldn't
matter, however, since either should give the same results.  grep,
however, didn't. I'm just wondering why it didn't.

Claude

p.s.  For those at last week's BUUG who asked why I just don't used a
database to manage my dataset, this is one reason why.  I'll often realize
that I've dropped a variable from a previous revision of the subset that I
actually need.  Then I have to merge in the variable from the previous
revision to whatever subset I'm currently working on.  It's true that if
the dataset was in a database that I could use SQL join commands to
accomplish this; however, if I don't have the previous revision, that's of
no help.  By keeping my dataset as a text file, I can keep it under
revision control and thereby jump back to an earlier version of the
dataset at any moment.  I've currently got a total of 62 revisions across
multiple branches -- in my experience, revision control is definitely the
way to go.

-------------------------

grep -ftmp mrw
where

tmp is a single-column file (there's a tab character at the end of each
line): 
^CASEID 
^1 
^2 
^3 
... 
^8916 <-- this value could be 8915, 8914, 8913, something relatively
large.  also note that all CASEIDs might not be present.  for example,
could be missing values 4, 345, and 6437.

and mrw is a two-column file delimited by a single tab:
CASEID	relative_person_weight
1	2.2096
2	1.25369
3	0.734946
...	...
8916	0.701728