[buug] grep weirdness

Ian Zimmerman itz at speakeasy.org
Tue May 21 12:54:38 PDT 2002


itz> Any reason why tmp cannot be simply the following?
itz> 
itz> ^CASEID ^[0-9]+
itz> 
itz> And, given that _all_ records of your sample file seem to match
itz> the one of the original regexes, what problem are you really
itz> trying to solve here?

Claude> D'oh!  My description of "tmp" is confusing.  You're correct
Claude> in that when there's a one-to-one match between the regexes in
Claude> the pattern file "tmp" and the values of the subset in "mrw",
Claude> a paste or join command will merge the files together
Claude> correctly (and, in fact, that's what I generally do in such a
Claude> circumstance).

Claude> Where a paste or join command will not work is when there are
Claude> fewer regexes in "tmp" than there are records in "mrw".  It's
Claude> in this case that I turn to the grep statement.  (And this
Claude> should explain why simply matching on numeric digits isn't
Claude> feasible.)

Claude> (I've moved the description of the files to the bottom of this
Claude> email for reference.)

Claude> First, let me say that (as I noted in the original email) I've
Claude> already solved my particular problem.  My email, therefore, is
Claude> more of a general inquiry: "Why does grep run fine when there
Claude> are 1,000 regexes in the pattern file but raise an error when
Claude> there are 8,000?"

Let me comment on your actual problem first.  I wouldn't use an SQL
database to do this, but I would use a Berkeley DB "database" (really
just a disk based hash lookup table), rather then grep.  Perl (and
probably Python, for the reptile lovers among us) would be a perfect
interface language.

Claude> Hopefully, this explains the problem.  I solve it by using a
Claude> grep statement (paste and join won't work since there are a
Claude> different number of records in each dataset). Specifically, I
Claude> use grep to identify records that match on a particular unique
Claude> identifier (i.e., CASEID).  I use the CASEIDs from the subset
Claude> as patterns to match against the full dataset (with the
Claude> variable to merge in).  Generally, this works fine.  However,
Claude> yesterday I discovered that if the subset is over a certain
Claude> size, I'll get a "grep: Regular expression too big" error.
Claude> I'm not sure exactly what this size limit is -- 1,110 records
Claude> works fine, 8,916 doesn't.

Maybe this lovely bit of regex.c (which is linked into GNU grep) will
shed some light.

 > /* This is not an arbitrary limit: the arguments which represent offsets
 >    into the pattern are two bytes long.  So if 2^16 bytes turns out to
 >    be too small, many things would have to change.  */
 > /* Any other compiler which, like MSC, has allocation limit below 2^16
 >    bytes will have to use approach similar to what was done below for
 >    MSC and drop MAX_BUF_SIZE a bit.  Otherwise you may end up
 >    reallocating to 0 bytes.  Such thing is not going to work too well.
 >    You have been warned!!  */
 > #if defined _MSC_VER  && !defined WIN32
 > /* Microsoft C 16-bit versions limit malloc to approx 65512 bytes.
 >    The REALLOC define eliminates a flurry of conversion warnings,
 >    but is not required. */
 > # define MAX_BUF_SIZE  65500L
 > # define REALLOC(p,s) realloc ((p), (size_t) (s))
 > #else
 > # define MAX_BUF_SIZE (1L << 16)
 > # define REALLOC(p,s) realloc ((p), (s))
 > #endif

 > /* Extend the buffer by twice its current size via realloc and
 >    reset the pointers that pointed into the old block to point to the
 >    correct places in the new one.  If extending the buffer results in it
 >    being larger than MAX_BUF_SIZE, then flag memory exhausted.  */

In other words, it's essentially a legacy 16-bit code issue.

However, note that even if this is fixed, it will only work for
backtracking greps (like GNU grep and other GNU tools based on the
regex library).  Other regular expression tools may be based on
NFA -> DFA conversion, and for such the number of DFA states, unless
the expression is hand-optimized, is exponential in the size of the
problem.  So I'd say that a regular expression approach is still a
poor match for your situation, in general.  See my suggestion above.

Cheers,

-- 
Ian Zimmerman, Oakland, California, U.S.A.
GPG: 433BA087  9C0F 194F 203A 63F7 B1B8  6E5A 8CA3 27DB 433B A087
EngSoc adopts market economy: cheap is wasteful, efficient is expensive.




More information about the buug mailing list