diff --git a/src/backend/regex/README b/src/backend/regex/README index 6c9f483..b4a7ad7 100644 *** a/src/backend/regex/README --- b/src/backend/regex/README *************** and similarly additional source files re *** 27,39 **** regexec. This was done to avoid exposing internal symbols globally; all functions not meant to be part of the library API are static. ! (Actually the above is a lie in one respect: there is one more global ! symbol, pg_set_regex_collation in regcomp. It is not meant to be part of ! the API, but it has to be global because both regcomp and regexec call it. ! It'd be better to get rid of that, as well as the static variables it ! sets, in favor of keeping the needed locale state in the regex structs. ! We have not done this yet for lack of a design for how to add ! application-specific state to the structs.) What's where in src/backend/regex/: --- 27,40 ---- regexec. This was done to avoid exposing internal symbols globally; all functions not meant to be part of the library API are static. ! (Actually the above is a lie in one respect: there are two more global ! symbols, pg_set_regex_collation and pg_reg_getcolor in regcomp. These are ! not meant to be part of the API, but they have to be global because both ! regcomp and regexec call them. It'd be better to get rid of ! pg_set_regex_collation, as well as the static variables it sets, in favor of ! keeping the needed locale state in the regex structs. We have not done this ! yet for lack of a design for how to add application-specific state to the ! structs.) What's where in src/backend/regex/: *************** colors: *** 274,301 **** an existing color has to be subdivided. The last two of these are handled with the "struct colordesc" array and ! the "colorchain" links in NFA arc structs. The color map proper (that ! is, the per-character lookup array) is handled as a multi-level tree, ! with each tree level indexed by one byte of a character's value. The ! code arranges to not have more than one copy of bottom-level tree pages ! that are all-the-same-color. ! Unfortunately, this design does not seem terribly efficient for common ! cases such as a tree in which all Unicode letters are colored the same, ! because there aren't that many places where we get a whole page all the ! same color, except at the end of the map. (It also strikes me that given ! PG's current restrictions on the range of Unicode values, we could use a ! 3-level rather than 4-level tree; but there's not provision for that in ! regguts.h at the moment.) ! A bigger problem is that it just doesn't seem very reasonable to have to ! consider each Unicode letter separately at regex parse time for a regex ! such as "\w"; more than likely, a huge percentage of those codes will ! never be seen at runtime. We need to fix things so that locale-based ! character classes are somehow processed "symbolically" without making a ! full expansion of their contents at parse time. This would mean that we'd ! have to be ready to call iswalpha() at runtime, but if that only happens ! for high-code-value characters, it shouldn't be a big performance hit. Detailed semantics of an NFA --- 275,330 ---- an existing color has to be subdivided. The last two of these are handled with the "struct colordesc" array and ! the "colorchain" links in NFA arc structs. ! Ideally, we'd do the first two operations using a simple linear array ! storing the current color assignment for each character code. ! Unfortunately, that's not terribly workable for large charsets such as ! Unicode. Our solution is to divide the color map into two parts. A simple ! linear array is used for character codes up to MAX_SIMPLE_CHR, which can be ! chosen large enough to include all popular characters (so that the ! significantly-slower code paths about to be described are seldom invoked). ! Characters above that need be considered at compile time only if they ! appear explicitly in the regex pattern. We store each such mentioned ! character or character range as an entry in the "colormaprange" array in ! the colormap. (Overlapping ranges are split into unique subranges, so that ! each range in the finished list needs only a single color that describes ! all its characters.) When mapping a character above MAX_SIMPLE_CHR to a ! color at runtime, we search this list of ranges explicitly. ! That's still not quite enough, though, because of locale-dependent ! character classes such as [[:alpha:]]. In Unicode locales these classes ! may have thousands of entries that are above MAX_SIMPLE_CHR, and we ! certainly don't want to be searching large colormaprange arrays at runtime. ! Nor do we even want to spend the time to initialize cvec structures that ! exhaustively describe all of those characters. Our solution is to compute ! exact per-character colors at compile time only up to MAX_SIMPLE_CHR. ! For characters above that, we apply the or lookup ! functions at runtime for each locale-dependent character class used in the ! regex pattern, constructing a bitmap that describes which classes the ! runtime character belongs to. The per-character-range data structure ! mentioned above actually holds, for each range, a separate color entry ! for each possible combination of character class properties. That is, ! the color map for characters above MAX_SIMPLE_CHR is really a 2-D array, ! whose rows correspond to character ranges that are explicitly mentioned ! in the input, and whose columns correspond to sets of relevant locale ! character classes. ! ! We build this color map in parallel with scanning the regex. Each time ! we detect a new explicit high character (or range) or a locale-dependent ! character class, we split existing entry(s) in the high color map so that ! characters we need to be able to distinguish will have distinct entries ! that can be given separate colors. Often, though, single entries in the ! high color map will represent very large sets of characters. ! ! If there are both explicit high characters/ranges and locale-dependent ! character classes, we may have entries in the high color map array that ! have non-WHITE colors but don't actually represent any real characters. ! (For example, in a row representing a singleton range, only one of the ! columns could possibly be a live entry; it's the one matching the actual ! locale properties for that single character.) We don't currently make ! any effort to reclaim such colors. In principle it could be done, but ! it's not clear that it's worth the trouble. Detailed semantics of an NFA diff --git a/src/backend/regex/regc_color.c b/src/backend/regex/regc_color.c index 8ffc8fb..7cb2d67 100644 *** a/src/backend/regex/regc_color.c --- b/src/backend/regex/regc_color.c *************** static void *** 49,58 **** initcm(struct vars * v, struct colormap * cm) { - int i; - int j; - union tree *t; - union tree *nextt; struct colordesc *cd; cm->magic = CMMAGIC; --- 49,54 ---- *************** initcm(struct vars * v, *** 64,87 **** cm->free = 0; cd = cm->cd; /* cm->cd[WHITE] */ cd->sub = NOSUB; cd->arcs = NULL; cd->firstchr = CHR_MIN; - cd->nchrs = CHR_MAX - CHR_MIN + 1; cd->flags = 0; ! /* upper levels of tree */ ! for (t = &cm->tree[0], j = NBYTS - 1; j > 0; t = nextt, j--) { ! nextt = t + 1; ! for (i = BYTTAB - 1; i >= 0; i--) ! t->tptr[i] = nextt; } ! /* bottom level is solid white */ ! t = &cm->tree[NBYTS - 1]; ! for (i = BYTTAB - 1; i >= 0; i--) ! t->tcolor[i] = WHITE; ! cd->block = t; } /* --- 60,99 ---- cm->free = 0; cd = cm->cd; /* cm->cd[WHITE] */ + cd->nschrs = MAX_SIMPLE_CHR - CHR_MIN + 1; + cd->nuchrs = 1; cd->sub = NOSUB; cd->arcs = NULL; cd->firstchr = CHR_MIN; cd->flags = 0; ! cm->locolormap = (color *) ! MALLOC((MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color)); ! if (cm->locolormap == NULL) { ! CERR(REG_ESPACE); ! cm->cmranges = NULL; /* prevent failure during freecm */ ! cm->hicolormap = NULL; ! return; } ! /* this relies on WHITE being zero: */ ! memset(cm->locolormap, WHITE, ! (MAX_SIMPLE_CHR - CHR_MIN + 1) * sizeof(color)); ! ! memset(cm->classbits, 0, sizeof(cm->classbits)); ! cm->numcmranges = 0; ! cm->cmranges = NULL; ! cm->maxarrayrows = 4; /* arbitrary initial allocation */ ! cm->hiarrayrows = 1; ! cm->hiarraycols = 1; ! cm->hicolormap = (color *) MALLOC(cm->maxarrayrows * sizeof(color)); ! if (cm->hicolormap == NULL) ! { ! CERR(REG_ESPACE); ! return; ! } ! /* initialize the "all other characters" row to WHITE */ ! cm->hicolormap[0] = WHITE; } /* *************** initcm(struct vars * v, *** 90,206 **** static void freecm(struct colormap * cm) { - size_t i; - union tree *cb; - cm->magic = 0; - if (NBYTS > 1) - cmtreefree(cm, cm->tree, 0); - for (i = 1; i <= cm->max; i++) /* skip WHITE */ - if (!UNUSEDCOLOR(&cm->cd[i])) - { - cb = cm->cd[i].block; - if (cb != NULL) - FREE(cb); - } if (cm->cd != cm->cdspace) FREE(cm->cd); } /* ! * cmtreefree - free a non-terminal part of a colormap tree */ ! static void ! cmtreefree(struct colormap * cm, ! union tree * tree, ! int level) /* level number (top == 0) of this block */ { ! int i; ! union tree *t; ! union tree *fillt = &cm->tree[level + 1]; ! union tree *cb; ! assert(level < NBYTS - 1); /* this level has pointers */ ! for (i = BYTTAB - 1; i >= 0; i--) { ! t = tree->tptr[i]; ! assert(t != NULL); ! if (t != fillt) { ! if (level < NBYTS - 2) ! { /* more pointer blocks below */ ! cmtreefree(cm, t, level + 1); ! FREE(t); ! } ! else ! { /* color block below */ ! cb = cm->cd[t->tcolor[0]].block; ! if (t != cb) /* not a solid block */ ! FREE(t); ! } } } - } - - /* - * setcolor - set the color of a character in a colormap - */ - static color /* previous color */ - setcolor(struct colormap * cm, - chr c, - color co) - { - uchr uc = c; - int shift; - int level; - int b; - int bottom; - union tree *t; - union tree *newt; - union tree *fillt; - union tree *lastt; - union tree *cb; - color prev; - - assert(cm->magic == CMMAGIC); - if (CISERR() || co == COLORLESS) - return COLORLESS; ! t = cm->tree; ! for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0; ! level++, shift -= BYTBITS) ! { ! b = (uc >> shift) & BYTMASK; ! lastt = t; ! t = lastt->tptr[b]; ! assert(t != NULL); ! fillt = &cm->tree[level + 1]; ! bottom = (shift <= BYTBITS) ? 1 : 0; ! cb = (bottom) ? cm->cd[t->tcolor[0]].block : fillt; ! if (t == fillt || t == cb) ! { /* must allocate a new block */ ! newt = (union tree *) MALLOC((bottom) ? ! sizeof(struct colors) : sizeof(struct ptrs)); ! if (newt == NULL) ! { ! CERR(REG_ESPACE); ! return COLORLESS; ! } ! if (bottom) ! memcpy(VS(newt->tcolor), VS(t->tcolor), ! BYTTAB * sizeof(color)); ! else ! memcpy(VS(newt->tptr), VS(t->tptr), ! BYTTAB * sizeof(union tree *)); ! t = newt; ! lastt->tptr[b] = t; ! } ! } ! b = uc & BYTMASK; ! prev = t->tcolor[b]; ! t->tcolor[b] = co; ! return prev; } /* --- 102,164 ---- static void freecm(struct colormap * cm) { cm->magic = 0; if (cm->cd != cm->cdspace) FREE(cm->cd); + if (cm->locolormap != NULL) + FREE(cm->locolormap); + if (cm->cmranges != NULL) + FREE(cm->cmranges); + if (cm->hicolormap != NULL) + FREE(cm->hicolormap); } /* ! * pg_reg_getcolor - slow case of GETCOLOR() */ ! color ! pg_reg_getcolor(struct colormap * cm, chr c) { ! int rownum, ! colnum, ! low, ! high; ! /* Should not be used for chrs in the locolormap */ ! assert(c > MAX_SIMPLE_CHR); ! ! /* ! * Find which row it's in. The colormapranges are in order, so we can use ! * binary search. ! */ ! rownum = 0; /* if no match, use array row zero */ ! low = 0; ! high = cm->numcmranges; ! while (low < high) { ! int middle = low + (high - low) / 2; ! const colormaprange *cmr = &cm->cmranges[middle]; ! ! if (c < cmr->cmin) ! high = middle; ! else if (c > cmr->cmax) ! low = middle + 1; ! else { ! rownum = cmr->rownum; /* found a match */ ! break; } } ! /* ! * Find which column it's in --- this is all locale-dependent. ! */ ! if (cm->hiarraycols > 1) ! colnum = cclass_column_index(cm, c); ! else ! colnum = 0; /* fast path if no relevant cclasses */ ! return cm->hicolormap[rownum * cm->hiarraycols + colnum]; } /* *************** maxcolor(struct colormap * cm) *** 216,222 **** } /* ! * newcolor - find a new color (must be subject of setcolor at once) * Beware: may relocate the colordescs. */ static color /* COLORLESS for error */ --- 174,180 ---- } /* ! * newcolor - find a new color (must be assigned at once) * Beware: may relocate the colordescs. */ static color /* COLORLESS for error */ *************** newcolor(struct colormap * cm) *** 278,289 **** cd = &cm->cd[cm->max]; } ! cd->nchrs = 0; cd->sub = NOSUB; cd->arcs = NULL; cd->firstchr = CHR_MIN; /* in case never set otherwise */ cd->flags = 0; - cd->block = NULL; return (color) (cd - cm->cd); } --- 236,247 ---- cd = &cm->cd[cm->max]; } ! cd->nschrs = 0; ! cd->nuchrs = 0; cd->sub = NOSUB; cd->arcs = NULL; cd->firstchr = CHR_MIN; /* in case never set otherwise */ cd->flags = 0; return (color) (cd - cm->cd); } *************** freecolor(struct colormap * cm, *** 305,317 **** assert(cd->arcs == NULL); assert(cd->sub == NOSUB); ! assert(cd->nchrs == 0); cd->flags = FREECOL; - if (cd->block != NULL) - { - FREE(cd->block); - cd->block = NULL; /* just paranoia */ - } if ((size_t) co == cm->max) { --- 263,271 ---- assert(cd->arcs == NULL); assert(cd->sub == NOSUB); ! assert(cd->nschrs == 0); ! assert(cd->nuchrs == 0); cd->flags = FREECOL; if ((size_t) co == cm->max) { *************** static color *** 354,370 **** pseudocolor(struct colormap * cm) { color co; co = newcolor(cm); if (CISERR()) return COLORLESS; ! cm->cd[co].nchrs = 1; ! cm->cd[co].flags = PSEUDO; return co; } /* * subcolor - allocate a new subcolor (if necessary) to this chr */ static color subcolor(struct colormap * cm, chr c) --- 308,332 ---- pseudocolor(struct colormap * cm) { color co; + struct colordesc *cd; co = newcolor(cm); if (CISERR()) return COLORLESS; ! cd = &cm->cd[co]; ! cd->nschrs = 0; ! cd->nuchrs = 1; /* pretend it is in the upper map */ ! cd->sub = NOSUB; ! cd->arcs = NULL; ! cd->firstchr = CHR_MIN; ! cd->flags = PSEUDO; return co; } /* * subcolor - allocate a new subcolor (if necessary) to this chr + * + * This works only for chrs that map into the low color map. */ static color subcolor(struct colormap * cm, chr c) *************** subcolor(struct colormap * cm, chr c) *** 372,378 **** color co; /* current color of c */ color sco; /* new subcolor */ ! co = GETCOLOR(cm, c); sco = newsub(cm, co); if (CISERR()) return COLORLESS; --- 334,342 ---- color co; /* current color of c */ color sco; /* new subcolor */ ! assert(c <= MAX_SIMPLE_CHR); ! ! co = cm->locolormap[c - CHR_MIN]; sco = newsub(cm, co); if (CISERR()) return COLORLESS; *************** subcolor(struct colormap * cm, chr c) *** 380,390 **** if (co == sco) /* already in an open subcolor */ return co; /* rest is redundant */ ! cm->cd[co].nchrs--; ! if (cm->cd[sco].nchrs == 0) cm->cd[sco].firstchr = c; ! cm->cd[sco].nchrs++; ! setcolor(cm, c, sco); return sco; } --- 344,380 ---- if (co == sco) /* already in an open subcolor */ return co; /* rest is redundant */ ! cm->cd[co].nschrs--; ! if (cm->cd[sco].nschrs == 0) cm->cd[sco].firstchr = c; ! cm->cd[sco].nschrs++; ! cm->locolormap[c - CHR_MIN] = sco; ! return sco; ! } ! ! /* ! * subcolorhi - allocate a new subcolor (if necessary) to this colormap entry ! * ! * This is the same processing as subcolor(), but for entries in the high ! * colormap, which do not necessarily correspond to exactly one chr code. ! */ ! static color ! subcolorhi(struct colormap * cm, color *pco) ! { ! color co; /* current color of entry */ ! color sco; /* new subcolor */ ! ! co = *pco; ! sco = newsub(cm, co); ! if (CISERR()) ! return COLORLESS; ! assert(sco != COLORLESS); ! ! if (co == sco) /* already in an open subcolor */ ! return co; /* rest is redundant */ ! cm->cd[co].nuchrs--; ! cm->cd[sco].nuchrs++; ! *pco = sco; return sco; } *************** newsub(struct colormap * cm, *** 400,406 **** sco = cm->cd[co].sub; if (sco == NOSUB) { /* color has no open subcolor */ ! if (cm->cd[co].nchrs == 1) /* optimization */ return co; sco = newcolor(cm); /* must create subcolor */ if (sco == COLORLESS) --- 390,397 ---- sco = cm->cd[co].sub; if (sco == NOSUB) { /* color has no open subcolor */ ! /* optimization: singly-referenced color need not be subcolored */ ! if ((cm->cd[co].nschrs + cm->cd[co].nuchrs) == 1) return co; sco = newcolor(cm); /* must create subcolor */ if (sco == COLORLESS) *************** newsub(struct colormap * cm, *** 417,552 **** } /* ! * subrange - allocate new subcolors to this range of chrs, fill in arcs */ ! static void ! subrange(struct vars * v, ! chr from, ! chr to, ! struct state * lp, ! struct state * rp) { ! uchr uf; int i; ! assert(from <= to); ! /* first, align "from" on a tree-block boundary */ ! uf = (uchr) from; ! i = (int) (((uf + BYTTAB - 1) & (uchr) ~BYTMASK) - uf); ! for (; from <= to && i > 0; i--, from++) ! newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp); ! if (from > to) /* didn't reach a boundary */ return; ! /* deal with whole blocks */ ! for (; to - from >= BYTTAB; from += BYTTAB) ! subblock(v, from, lp, rp); ! /* clean up any remaining partial table */ ! for (; from <= to; from++) ! newarc(v->nfa, PLAIN, subcolor(v->cm, from), lp, rp); } /* ! * subblock - allocate new subcolors for one tree block of chrs, fill in arcs * ! * Note: subcolors that are created during execution of this function ! * will not be given a useful value of firstchr; it'll be left as CHR_MIN. ! * For the current usage of firstchr in pg_regprefix, this does not matter ! * because such subcolors won't occur in the common prefix of a regex. */ static void ! subblock(struct vars * v, ! chr start, /* first of BYTTAB chrs */ ! struct state * lp, ! struct state * rp) { - uchr uc = start; struct colormap *cm = v->cm; ! int shift; ! int level; int i; - int b; - union tree *t; - union tree *cb; - union tree *fillt; - union tree *lastt; - int previ; - int ndone; - color co; - color sco; ! assert((uc % BYTTAB) == 0); ! /* find its color block, making new pointer blocks as needed */ ! t = cm->tree; ! fillt = NULL; ! for (level = 0, shift = BYTBITS * (NBYTS - 1); shift > 0; ! level++, shift -= BYTBITS) { ! b = (uc >> shift) & BYTMASK; ! lastt = t; ! t = lastt->tptr[b]; ! assert(t != NULL); ! fillt = &cm->tree[level + 1]; ! if (t == fillt && shift > BYTBITS) ! { /* need new ptr block */ ! t = (union tree *) MALLOC(sizeof(struct ptrs)); ! if (t == NULL) { ! CERR(REG_ESPACE); ! return; } - memcpy(VS(t->tptr), VS(fillt->tptr), - BYTTAB * sizeof(union tree *)); - lastt->tptr[b] = t; } } ! /* special cases: fill block or solid block */ ! co = t->tcolor[0]; ! cb = cm->cd[co].block; ! if (t == fillt || t == cb) { ! /* either way, we want a subcolor solid block */ ! sco = newsub(cm, co); ! t = cm->cd[sco].block; ! if (t == NULL) ! { /* must set it up */ ! t = (union tree *) MALLOC(sizeof(struct colors)); ! if (t == NULL) { ! CERR(REG_ESPACE); ! return; } - for (i = 0; i < BYTTAB; i++) - t->tcolor[i] = sco; - cm->cd[sco].block = t; } ! /* find loop must have run at least once */ ! lastt->tptr[b] = t; ! newarc(v->nfa, PLAIN, sco, lp, rp); ! cm->cd[co].nchrs -= BYTTAB; ! cm->cd[sco].nchrs += BYTTAB; return; } ! /* general case, a mixed block to be altered */ ! i = 0; ! while (i < BYTTAB) { ! co = t->tcolor[i]; ! sco = newsub(cm, co); ! newarc(v->nfa, PLAIN, sco, lp, rp); ! previ = i; ! do { ! t->tcolor[i++] = sco; ! } while (i < BYTTAB && t->tcolor[i] == co); ! ndone = i - previ; ! cm->cd[co].nchrs -= ndone; ! cm->cd[sco].nchrs += ndone; } } --- 408,907 ---- } /* ! * newhicolorrow - get a new row in the hicolormap, cloning it from oldrow ! * ! * Returns array index of new row. Note the array might move. */ ! static int ! newhicolorrow(struct colormap * cm, ! int oldrow) { ! int newrow = cm->hiarrayrows; ! color *newrowptr; int i; ! /* Assign a fresh array row index, enlarging storage if needed */ ! if (newrow >= cm->maxarrayrows) ! { ! color *newarray; ! if (cm->maxarrayrows >= INT_MAX / (cm->hiarraycols * 2)) ! { ! CERR(REG_ESPACE); ! return 0; ! } ! newarray = (color *) REALLOC(cm->hicolormap, ! cm->maxarrayrows * 2 * ! cm->hiarraycols * sizeof(color)); ! if (newarray == NULL) ! { ! CERR(REG_ESPACE); ! return 0; ! } ! cm->hicolormap = newarray; ! cm->maxarrayrows *= 2; ! } ! cm->hiarrayrows++; ! ! /* Copy old row data */ ! newrowptr = &cm->hicolormap[newrow * cm->hiarraycols]; ! memcpy(newrowptr, ! &cm->hicolormap[oldrow * cm->hiarraycols], ! cm->hiarraycols * sizeof(color)); ! ! /* Increase color reference counts to reflect new colormap entries */ ! for (i = 0; i < cm->hiarraycols; i++) ! cm->cd[newrowptr[i]].nuchrs++; ! ! return newrow; ! } ! ! /* ! * newhicolorcols - create a new set of columns in the high colormap ! * ! * Essentially, extends the 2-D array to the right with a copy of itself. ! */ ! static void ! newhicolorcols(struct colormap * cm) ! { ! color *newarray; ! int r, ! c; ! ! if (cm->hiarraycols >= INT_MAX / (cm->maxarrayrows * 2)) ! { ! CERR(REG_ESPACE); return; + } + newarray = (color *) REALLOC(cm->hicolormap, + cm->maxarrayrows * + cm->hiarraycols * 2 * sizeof(color)); + if (newarray == NULL) + { + CERR(REG_ESPACE); + return; + } + cm->hicolormap = newarray; ! /* Duplicate existing columns to the right, and increase ref counts */ ! /* Must work downwards in the array because we realloc'd in place */ ! for (r = cm->hiarrayrows - 1; r >= 0; r--) ! { ! color *oldrowptr = &newarray[r * cm->hiarraycols]; ! color *newrowptr = &newarray[r * cm->hiarraycols * 2]; ! color *newrowptr2 = newrowptr + cm->hiarraycols; ! for (c = 0; c < cm->hiarraycols; c++) ! { ! color co = oldrowptr[c]; ! ! newrowptr[c] = newrowptr2[c] = co; ! cm->cd[co].nuchrs++; ! } ! } ! ! cm->hiarraycols *= 2; } /* ! * subcolorcvec - allocate new subcolors to cvec members, fill in arcs * ! * For each chr "c" represented by the cvec, do the equivalent of ! * newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp); ! * ! * Note that in typical cases, many of the subcolors are the same. ! * While newarc() would discard duplicate arc requests, we can save ! * some cycles by not calling it repetitively to begin with. This is ! * mechanized with the "lastsubcolor" state variable. */ static void ! subcolorcvec(struct vars * v, ! struct cvec * cv, ! struct state * lp, ! struct state * rp) { struct colormap *cm = v->cm; ! color lastsubcolor = COLORLESS; ! chr ch, ! from, ! to; ! const chr *p; int i; ! /* ordinary characters */ ! for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--) ! { ! ch = *p; ! subcoloronechr(v, ch, lp, rp, &lastsubcolor); ! NOERR(); ! } ! /* and the ranges */ ! for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--) { ! from = *p; ! to = *(p + 1); ! if (from <= MAX_SIMPLE_CHR) ! { ! /* deal with simple chars one at a time */ ! chr lim = (to <= MAX_SIMPLE_CHR) ? to : MAX_SIMPLE_CHR; ! ! while (from <= lim) { ! color sco = subcolor(cm, from); ! ! NOERR(); ! if (sco != lastsubcolor) ! { ! newarc(v->nfa, PLAIN, sco, lp, rp); ! NOERR(); ! lastsubcolor = sco; ! } ! from++; } } + /* deal with any part of the range that's above MAX_SIMPLE_CHR */ + if (from < to) + subcoloronerange(v, from, to, lp, rp, &lastsubcolor); + else if (from == to) + subcoloronechr(v, from, lp, rp, &lastsubcolor); + NOERR(); } ! /* and deal with cclass if any */ ! if (cv->cclasscode >= 0) { ! int classbit; ! color *pco; ! int r, ! c; ! ! /* Enlarge array if we don't have a column bit assignment for cclass */ ! if (cm->classbits[cv->cclasscode] == 0) ! { ! cm->classbits[cv->cclasscode] = cm->hiarraycols; ! newhicolorcols(cm); ! NOERR(); ! } ! /* Apply subcolorhi() and make arc for each entry in relevant cols */ ! classbit = cm->classbits[cv->cclasscode]; ! pco = cm->hicolormap; ! for (r = 0; r < cm->hiarrayrows; r++) ! { ! for (c = 0; c < cm->hiarraycols; c++) { ! if (c & classbit) ! { ! color sco = subcolorhi(cm, pco); ! ! NOERR(); ! /* add the arc if needed */ ! if (sco != lastsubcolor) ! { ! newarc(v->nfa, PLAIN, sco, lp, rp); ! NOERR(); ! lastsubcolor = sco; ! } ! } ! pco++; } } ! } ! } ! ! /* ! * subcoloronechr - do subcolorcvec's work for a singleton chr ! * ! * We could just let subcoloronerange do this, but it's a bit more efficient ! * if we exploit the single-chr case. Also, callers find it useful for this ! * to be able to handle both low and high chr codes. ! */ ! static void ! subcoloronechr(struct vars * v, ! chr ch, ! struct state * lp, ! struct state * rp, ! color *lastsubcolor) ! { ! struct colormap *cm = v->cm; ! colormaprange *newranges; ! int numnewranges; ! colormaprange *oldrange; ! int oldrangen; ! int newrow; ! ! /* Easy case for low chr codes */ ! if (ch <= MAX_SIMPLE_CHR) ! { ! color sco = subcolor(cm, ch); ! ! NOERR(); ! if (sco != *lastsubcolor) ! { ! newarc(v->nfa, PLAIN, sco, lp, rp); ! *lastsubcolor = sco; ! } return; } ! /* ! * Potentially, we could need two more colormapranges than we have now, if ! * the given chr is in the middle of some existing range. ! */ ! newranges = (colormaprange *) ! MALLOC((cm->numcmranges + 2) * sizeof(colormaprange)); ! if (newranges == NULL) { ! CERR(REG_ESPACE); ! return; ! } ! numnewranges = 0; ! ! /* Ranges before target are unchanged */ ! for (oldrange = cm->cmranges, oldrangen = 0; ! oldrangen < cm->numcmranges; ! oldrange++, oldrangen++) ! { ! if (oldrange->cmax >= ch) ! break; ! newranges[numnewranges++] = *oldrange; ! } ! ! /* Match target chr against current range */ ! if (oldrangen >= cm->numcmranges || oldrange->cmin > ch) ! { ! /* chr does not belong to any existing range, make a new one */ ! newranges[numnewranges].cmin = ch; ! newranges[numnewranges].cmax = ch; ! /* row state should be cloned from the "all others" row */ ! newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0); ! numnewranges++; ! } ! else if (oldrange->cmin == oldrange->cmax) ! { ! /* we have an existing singleton range matching the chr */ ! newranges[numnewranges++] = *oldrange; ! newrow = oldrange->rownum; ! /* we've now fully processed this old range */ ! oldrange++, oldrangen++; ! } ! else ! { ! /* chr is a subset of this existing range, must split it */ ! if (ch > oldrange->cmin) { ! /* emit portion of old range before chr */ ! newranges[numnewranges].cmin = oldrange->cmin; ! newranges[numnewranges].cmax = ch - 1; ! newranges[numnewranges].rownum = oldrange->rownum; ! numnewranges++; ! } ! /* emit chr as singleton range, initially cloning from range */ ! newranges[numnewranges].cmin = ch; ! newranges[numnewranges].cmax = ch; ! newranges[numnewranges].rownum = newrow = ! newhicolorrow(cm, oldrange->rownum); ! numnewranges++; ! if (ch < oldrange->cmax) ! { ! /* emit portion of old range after chr */ ! newranges[numnewranges].cmin = ch + 1; ! newranges[numnewranges].cmax = oldrange->cmax; ! /* must clone the row if we are making two new ranges from old */ ! newranges[numnewranges].rownum = ! (ch > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) : ! oldrange->rownum; ! numnewranges++; ! } ! /* we've now fully processed this old range */ ! oldrange++, oldrangen++; ! } ! ! /* Update colors in newrow and create arcs as needed */ ! subcoloronerow(v, newrow, lp, rp, lastsubcolor); ! ! /* Ranges after target are unchanged */ ! for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++) ! { ! newranges[numnewranges++] = *oldrange; ! } ! ! /* Assert our original space estimate was adequate */ ! assert(numnewranges <= (cm->numcmranges + 2)); ! ! /* And finally, store back the updated list of ranges */ ! if (cm->cmranges != NULL) ! FREE(cm->cmranges); ! cm->cmranges = newranges; ! cm->numcmranges = numnewranges; ! } ! ! /* ! * subcoloronerange - do subcolorcvec's work for a high range ! */ ! static void ! subcoloronerange(struct vars * v, ! chr from, ! chr to, ! struct state * lp, ! struct state * rp, ! color *lastsubcolor) ! { ! struct colormap *cm = v->cm; ! colormaprange *newranges; ! int numnewranges; ! colormaprange *oldrange; ! int oldrangen; ! int newrow; ! ! /* Caller should take care of non-high-range cases */ ! assert(from > MAX_SIMPLE_CHR); ! assert(from < to); ! ! /* ! * Potentially, if we have N non-adjacent ranges, we could need as many as ! * 2N+1 result ranges (consider case where new range spans 'em all). ! */ ! newranges = (colormaprange *) ! MALLOC((cm->numcmranges * 2 + 1) * sizeof(colormaprange)); ! if (newranges == NULL) ! { ! CERR(REG_ESPACE); ! return; ! } ! numnewranges = 0; ! ! /* Ranges before target are unchanged */ ! for (oldrange = cm->cmranges, oldrangen = 0; ! oldrangen < cm->numcmranges; ! oldrange++, oldrangen++) ! { ! if (oldrange->cmax >= from) ! break; ! newranges[numnewranges++] = *oldrange; ! } ! ! /* ! * Deal with ranges that (partially) overlap the target. As we process ! * each such range, increase "from" to remove the dealt-with characters ! * from the target range. ! */ ! while (oldrangen < cm->numcmranges && oldrange->cmin <= to) ! { ! if (from < oldrange->cmin) ! { ! /* Handle portion of new range that corresponds to no old range */ ! newranges[numnewranges].cmin = from; ! newranges[numnewranges].cmax = oldrange->cmin - 1; ! /* row state should be cloned from the "all others" row */ ! newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0); ! numnewranges++; ! /* Update colors in newrow and create arcs as needed */ ! subcoloronerow(v, newrow, lp, rp, lastsubcolor); ! /* We've now fully processed the part of new range before old */ ! from = oldrange->cmin; ! } ! ! if (from <= oldrange->cmin && to >= oldrange->cmax) ! { ! /* old range is fully contained in new, process it in-place */ ! newranges[numnewranges++] = *oldrange; ! newrow = oldrange->rownum; ! from = oldrange->cmax + 1; ! } ! else ! { ! /* some part of old range does not overlap new range */ ! if (from > oldrange->cmin) ! { ! /* emit portion of old range before new range */ ! newranges[numnewranges].cmin = oldrange->cmin; ! newranges[numnewranges].cmax = from - 1; ! newranges[numnewranges].rownum = oldrange->rownum; ! numnewranges++; ! } ! /* emit common subrange, initially cloning from old range */ ! newranges[numnewranges].cmin = from; ! newranges[numnewranges].cmax = ! (to < oldrange->cmax) ? to : oldrange->cmax; ! newranges[numnewranges].rownum = newrow = ! newhicolorrow(cm, oldrange->rownum); ! numnewranges++; ! if (to < oldrange->cmax) ! { ! /* emit portion of old range after new range */ ! newranges[numnewranges].cmin = to + 1; ! newranges[numnewranges].cmax = oldrange->cmax; ! /* must clone the row if we are making two new ranges from old */ ! newranges[numnewranges].rownum = ! (from > oldrange->cmin) ? newhicolorrow(cm, oldrange->rownum) : ! oldrange->rownum; ! numnewranges++; ! } ! from = oldrange->cmax + 1; ! } ! /* Update colors in newrow and create arcs as needed */ ! subcoloronerow(v, newrow, lp, rp, lastsubcolor); ! /* we've now fully processed this old range */ ! oldrange++, oldrangen++; ! } ! ! if (from <= to) ! { ! /* Handle portion of new range that corresponds to no old range */ ! newranges[numnewranges].cmin = from; ! newranges[numnewranges].cmax = to; ! /* row state should be cloned from the "all others" row */ ! newranges[numnewranges].rownum = newrow = newhicolorrow(cm, 0); ! numnewranges++; ! /* Update colors in newrow and create arcs as needed */ ! subcoloronerow(v, newrow, lp, rp, lastsubcolor); ! } ! ! /* Ranges after target are unchanged */ ! for (; oldrangen < cm->numcmranges; oldrange++, oldrangen++) ! { ! newranges[numnewranges++] = *oldrange; ! } ! ! /* Assert our original space estimate was adequate */ ! assert(numnewranges <= (cm->numcmranges * 2 + 1)); ! ! /* And finally, store back the updated list of ranges */ ! if (cm->cmranges != NULL) ! FREE(cm->cmranges); ! cm->cmranges = newranges; ! cm->numcmranges = numnewranges; ! } ! ! /* ! * subcoloronerow - do subcolorcvec's work for one new row in the high colormap ! */ ! static void ! subcoloronerow(struct vars * v, ! int rownum, ! struct state * lp, ! struct state * rp, ! color *lastsubcolor) ! { ! struct colormap *cm = v->cm; ! color *pco; ! int i; ! ! /* Apply subcolorhi() and make arc for each entry in row */ ! pco = &cm->hicolormap[rownum * cm->hiarraycols]; ! for (i = 0; i < cm->hiarraycols; pco++, i++) ! { ! color sco = subcolorhi(cm, pco); ! ! NOERR(); ! /* make the arc if needed */ ! if (sco != *lastsubcolor) ! { ! newarc(v->nfa, PLAIN, sco, lp, rp); ! NOERR(); ! *lastsubcolor = sco; ! } } } *************** okcolors(struct nfa * nfa, *** 575,586 **** { /* is subcolor, let parent deal with it */ } ! else if (cd->nchrs == 0) { /* parent empty, its arcs change color to subcolor */ cd->sub = NOSUB; scd = &cm->cd[sco]; ! assert(scd->nchrs > 0); assert(scd->sub == sco); scd->sub = NOSUB; while ((a = cd->arcs) != NULL) --- 930,941 ---- { /* is subcolor, let parent deal with it */ } ! else if (cd->nschrs == 0 && cd->nuchrs == 0) { /* parent empty, its arcs change color to subcolor */ cd->sub = NOSUB; scd = &cm->cd[sco]; ! assert(scd->nschrs > 0 || scd->nuchrs > 0); assert(scd->sub == sco); scd->sub = NOSUB; while ((a = cd->arcs) != NULL) *************** okcolors(struct nfa * nfa, *** 597,603 **** /* parent's arcs must gain parallel subcolor arcs */ cd->sub = NOSUB; scd = &cm->cd[sco]; ! assert(scd->nchrs > 0); assert(scd->sub == sco); scd->sub = NOSUB; for (a = cd->arcs; a != NULL; a = a->colorchain) --- 952,958 ---- /* parent's arcs must gain parallel subcolor arcs */ cd->sub = NOSUB; scd = &cm->cd[sco]; ! assert(scd->nschrs > 0 || scd->nuchrs > 0); assert(scd->sub == sco); scd->sub = NOSUB; for (a = cd->arcs; a != NULL; a = a->colorchain) *************** dumpcolors(struct colormap * cm, *** 711,732 **** struct colordesc *end; color co; chr c; - char *has; fprintf(f, "max %ld\n", (long) cm->max); - if (NBYTS > 1) - fillcheck(cm, cm->tree, 0, f); end = CDEND(cm); for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */ if (!UNUSEDCOLOR(cd)) { ! assert(cd->nchrs > 0); ! has = (cd->block != NULL) ? "#" : ""; if (cd->flags & PSEUDO) ! fprintf(f, "#%2ld%s(ps): ", (long) co, has); else ! fprintf(f, "#%2ld%s(%2d): ", (long) co, ! has, cd->nchrs); /* * Unfortunately, it's hard to do this next bit more efficiently. --- 1066,1083 ---- struct colordesc *end; color co; chr c; fprintf(f, "max %ld\n", (long) cm->max); end = CDEND(cm); for (cd = cm->cd + 1, co = 1; cd < end; cd++, co++) /* skip 0 */ + { if (!UNUSEDCOLOR(cd)) { ! assert(cd->nschrs > 0 || cd->nuchrs > 0); if (cd->flags & PSEUDO) ! fprintf(f, "#%2ld(ps): ", (long) co); else ! fprintf(f, "#%2ld(%2d): ", (long) co, cd->nschrs + cd->nuchrs); /* * Unfortunately, it's hard to do this next bit more efficiently. *************** dumpcolors(struct colormap * cm, *** 741,772 **** dumpchr(c, f); fprintf(f, "\n"); } ! } ! ! /* ! * fillcheck - check proper filling of a tree ! */ ! static void ! fillcheck(struct colormap * cm, ! union tree * tree, ! int level, /* level number (top == 0) of this block */ ! FILE *f) ! { ! int i; ! union tree *t; ! union tree *fillt = &cm->tree[level + 1]; ! ! assert(level < NBYTS - 1); /* this level has pointers */ ! for (i = BYTTAB - 1; i >= 0; i--) { ! t = tree->tptr[i]; ! if (t == NULL) ! fprintf(f, "NULL found in filled tree!\n"); ! else if (t == fillt) { } - else if (level < NBYTS - 2) /* more pointer blocks below */ - fillcheck(cm, t, level + 1, f); } } --- 1092,1124 ---- dumpchr(c, f); fprintf(f, "\n"); } ! } ! /* dump the high colormap if it contains anything interesting */ ! if (cm->hiarrayrows > 1 || cm->hiarraycols > 1) { ! int r, ! c; ! const color *rowptr; ! ! fprintf(f, "other:\t"); ! for (c = 0; c < cm->hiarraycols; c++) { + fprintf(f, "\t%ld", (long) cm->hicolormap[c]); + } + fprintf(f, "\n"); + for (r = 0; r < cm->numcmranges; r++) + { + dumpchr(cm->cmranges[r].cmin, f); + fprintf(f, ".."); + dumpchr(cm->cmranges[r].cmax, f); + fprintf(f, ":"); + rowptr = &cm->hicolormap[cm->cmranges[r].rownum * cm->hiarraycols]; + for (c = 0; c < cm->hiarraycols; c++) + { + fprintf(f, "\t%ld", (long) rowptr[c]); + } + fprintf(f, "\n"); } } } diff --git a/src/backend/regex/regc_cvec.c b/src/backend/regex/regc_cvec.c index 3a9e8cf..50b7a45 100644 *** a/src/backend/regex/regc_cvec.c --- b/src/backend/regex/regc_cvec.c *************** *** 34,40 **** /* * Notes: ! * Only (selected) functions in _this_ file should treat chr* as non-constant. */ /* --- 34,41 ---- /* * Notes: ! * Only (selected) functions in _this_ file should treat the chr arrays ! * of a cvec as non-constant. */ /* *************** clearcvec(struct cvec * cv) *** 67,72 **** --- 68,74 ---- assert(cv != NULL); cv->nchrs = 0; cv->nranges = 0; + cv->cclasscode = -1; return cv; } diff --git a/src/backend/regex/regc_locale.c b/src/backend/regex/regc_locale.c index 399de02..7cb3a40 100644 *** a/src/backend/regex/regc_locale.c --- b/src/backend/regex/regc_locale.c *************** static const struct cname *** 349,354 **** --- 349,367 ---- } }; + /* + * The following arrays define the valid character class names. + */ + static const char *const classNames[NUM_CCLASSES + 1] = { + "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph", + "lower", "print", "punct", "space", "upper", "xdigit", NULL + }; + + enum classes + { + CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH, + CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT + }; /* * We do not use the hard-wired Unicode classification tables that Tcl does. *************** cclass(struct vars * v, /* context */ *** 544,564 **** index; /* - * The following arrays define the valid character class names. - */ - - static const char *const classNames[] = { - "alnum", "alpha", "ascii", "blank", "cntrl", "digit", "graph", - "lower", "print", "punct", "space", "upper", "xdigit", NULL - }; - - enum classes - { - CC_ALNUM, CC_ALPHA, CC_ASCII, CC_BLANK, CC_CNTRL, CC_DIGIT, CC_GRAPH, - CC_LOWER, CC_PRINT, CC_PUNCT, CC_SPACE, CC_UPPER, CC_XDIGIT - }; - - /* * Map the name to the corresponding enumerated value. */ len = endp - startp; --- 557,562 ---- *************** cclass(struct vars * v, /* context */ *** 593,610 **** * pg_ctype_get_cache so that we can cache the results. Other classes * have definitions that are hard-wired here, and for those we just * construct a transient cvec on the fly. */ switch ((enum classes) index) { case CC_PRINT: ! cv = pg_ctype_get_cache(pg_wc_isprint); break; case CC_ALNUM: ! cv = pg_ctype_get_cache(pg_wc_isalnum); break; case CC_ALPHA: ! cv = pg_ctype_get_cache(pg_wc_isalpha); break; case CC_ASCII: /* hard-wired meaning */ --- 591,610 ---- * pg_ctype_get_cache so that we can cache the results. Other classes * have definitions that are hard-wired here, and for those we just * construct a transient cvec on the fly. + * + * NB: keep this code in sync with cclass_column_index(), below. */ switch ((enum classes) index) { case CC_PRINT: ! cv = pg_ctype_get_cache(pg_wc_isprint, index); break; case CC_ALNUM: ! cv = pg_ctype_get_cache(pg_wc_isalnum, index); break; case CC_ALPHA: ! cv = pg_ctype_get_cache(pg_wc_isalpha, index); break; case CC_ASCII: /* hard-wired meaning */ *************** cclass(struct vars * v, /* context */ *** 625,634 **** addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: ! cv = pg_ctype_get_cache(pg_wc_isdigit); break; case CC_PUNCT: ! cv = pg_ctype_get_cache(pg_wc_ispunct); break; case CC_XDIGIT: --- 625,634 ---- addrange(cv, 0x7f, 0x9f); break; case CC_DIGIT: ! cv = pg_ctype_get_cache(pg_wc_isdigit, index); break; case CC_PUNCT: ! cv = pg_ctype_get_cache(pg_wc_ispunct, index); break; case CC_XDIGIT: *************** cclass(struct vars * v, /* context */ *** 646,661 **** } break; case CC_SPACE: ! cv = pg_ctype_get_cache(pg_wc_isspace); break; case CC_LOWER: ! cv = pg_ctype_get_cache(pg_wc_islower); break; case CC_UPPER: ! cv = pg_ctype_get_cache(pg_wc_isupper); break; case CC_GRAPH: ! cv = pg_ctype_get_cache(pg_wc_isgraph); break; } --- 646,661 ---- } break; case CC_SPACE: ! cv = pg_ctype_get_cache(pg_wc_isspace, index); break; case CC_LOWER: ! cv = pg_ctype_get_cache(pg_wc_islower, index); break; case CC_UPPER: ! cv = pg_ctype_get_cache(pg_wc_isupper, index); break; case CC_GRAPH: ! cv = pg_ctype_get_cache(pg_wc_isgraph, index); break; } *************** cclass(struct vars * v, /* context */ *** 666,671 **** --- 666,712 ---- } /* + * cclass_column_index - get appropriate high colormap column index for chr + */ + static int + cclass_column_index(struct colormap * cm, chr c) + { + int colnum = 0; + + /* Shouldn't go through all these pushups for simple chrs */ + assert(c > MAX_SIMPLE_CHR); + + /* + * Note: we should not see requests to consider cclasses that are not + * treated as locale-specific by cclass(), above. + */ + if (cm->classbits[CC_PRINT] && pg_wc_isprint(c)) + colnum |= cm->classbits[CC_PRINT]; + if (cm->classbits[CC_ALNUM] && pg_wc_isalnum(c)) + colnum |= cm->classbits[CC_ALNUM]; + if (cm->classbits[CC_ALPHA] && pg_wc_isalpha(c)) + colnum |= cm->classbits[CC_ALPHA]; + assert(cm->classbits[CC_ASCII] == 0); + assert(cm->classbits[CC_BLANK] == 0); + assert(cm->classbits[CC_CNTRL] == 0); + if (cm->classbits[CC_DIGIT] && pg_wc_isdigit(c)) + colnum |= cm->classbits[CC_DIGIT]; + if (cm->classbits[CC_PUNCT] && pg_wc_ispunct(c)) + colnum |= cm->classbits[CC_PUNCT]; + assert(cm->classbits[CC_XDIGIT] == 0); + if (cm->classbits[CC_SPACE] && pg_wc_isspace(c)) + colnum |= cm->classbits[CC_SPACE]; + if (cm->classbits[CC_LOWER] && pg_wc_islower(c)) + colnum |= cm->classbits[CC_LOWER]; + if (cm->classbits[CC_UPPER] && pg_wc_isupper(c)) + colnum |= cm->classbits[CC_UPPER]; + if (cm->classbits[CC_GRAPH] && pg_wc_isgraph(c)) + colnum |= cm->classbits[CC_GRAPH]; + + return colnum; + } + + /* * allcases - supply cvec for all case counterparts of a chr (including itself) * * This is a shortcut, preferably an efficient one, for simple characters; diff --git a/src/backend/regex/regc_pg_locale.c b/src/backend/regex/regc_pg_locale.c index 551ae7d..ad9d4b1 100644 *** a/src/backend/regex/regc_pg_locale.c --- b/src/backend/regex/regc_pg_locale.c *************** store_match(pg_ctype_cache *pcc, pg_wcha *** 736,742 **** * Note that the result must not be freed or modified by caller. */ static struct cvec * ! pg_ctype_get_cache(pg_wc_probefunc probefunc) { pg_ctype_cache *pcc; pg_wchar max_chr; --- 736,742 ---- * Note that the result must not be freed or modified by caller. */ static struct cvec * ! pg_ctype_get_cache(pg_wc_probefunc probefunc, int cclasscode) { pg_ctype_cache *pcc; pg_wchar max_chr; *************** pg_ctype_get_cache(pg_wc_probefunc probe *** 770,800 **** pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2); if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL) goto out_of_memory; /* ! * Decide how many character codes we ought to look through. For C locale ! * there's no need to go further than 127. Otherwise, if the encoding is ! * UTF8 go up to 0x7FF, which is a pretty arbitrary cutoff but we cannot ! * extend it as far as we'd like (say, 0xFFFF, the end of the Basic ! * Multilingual Plane) without creating significant performance issues due ! * to too many characters being fed through the colormap code. This will ! * need redesign to fix reasonably, but at least for the moment we have ! * all common European languages covered. Otherwise (not C, not UTF8) go ! * up to 255. These limits are interrelated with restrictions discussed ! * at the head of this file. */ switch (pg_regex_strategy) { case PG_REGEX_LOCALE_C: max_chr = (pg_wchar) 127; break; case PG_REGEX_LOCALE_WIDE: case PG_REGEX_LOCALE_WIDE_L: ! max_chr = (pg_wchar) 0x7FF; break; case PG_REGEX_LOCALE_1BYTE: case PG_REGEX_LOCALE_1BYTE_L: max_chr = (pg_wchar) UCHAR_MAX; break; default: max_chr = 0; /* can't get here, but keep compiler quiet */ --- 770,812 ---- pcc->cv.ranges = (chr *) malloc(pcc->cv.rangespace * sizeof(chr) * 2); if (pcc->cv.chrs == NULL || pcc->cv.ranges == NULL) goto out_of_memory; + pcc->cv.cclasscode = cclasscode; /* ! * Decide how many character codes we ought to look through. In general ! * we don't go past MAX_SIMPLE_CHR; chr codes above that are handled at ! * runtime using the "high colormap" mechanism. However, in C locale ! * there's no need to go further than 127, and if we only have a 1-byte ! * API there's no need to go further than that can handle. ! * ! * If it's not MAX_SIMPLE_CHR that's constraining the search, mark the ! * output cvec as not having any locale-dependent behavior, since there ! * will be no need to do any run-time locale checks. (The #if's here ! * would always be true for production values of MAX_SIMPLE_CHR, but it's ! * useful to allow it to be small for testing purposes.) */ switch (pg_regex_strategy) { case PG_REGEX_LOCALE_C: + #if MAX_SIMPLE_CHR >= 127 max_chr = (pg_wchar) 127; + pcc->cv.cclasscode = -1; + #else + max_chr = (pg_wchar) MAX_SIMPLE_CHR; + #endif break; case PG_REGEX_LOCALE_WIDE: case PG_REGEX_LOCALE_WIDE_L: ! max_chr = (pg_wchar) MAX_SIMPLE_CHR; break; case PG_REGEX_LOCALE_1BYTE: case PG_REGEX_LOCALE_1BYTE_L: + #if MAX_SIMPLE_CHR >= UCHAR_MAX max_chr = (pg_wchar) UCHAR_MAX; + pcc->cv.cclasscode = -1; + #else + max_chr = (pg_wchar) MAX_SIMPLE_CHR; + #endif break; default: max_chr = 0; /* can't get here, but keep compiler quiet */ diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c index b211cc0..ed95474 100644 *** a/src/backend/regex/regcomp.c --- b/src/backend/regex/regcomp.c *************** static void cbracket(struct vars *, stru *** 55,61 **** static void brackpart(struct vars *, struct state *, struct state *); static const chr *scanplain(struct vars *); static void onechr(struct vars *, chr, struct state *, struct state *); - static void dovec(struct vars *, struct cvec *, struct state *, struct state *); static void wordchrs(struct vars *); static void processlacon(struct vars *, struct state *, struct state *, int, struct state *, struct state *); --- 55,60 ---- *************** static chr chrnamed(struct vars *, const *** 96,111 **** /* === regc_color.c === */ static void initcm(struct vars *, struct colormap *); static void freecm(struct colormap *); - static void cmtreefree(struct colormap *, union tree *, int); - static color setcolor(struct colormap *, chr, color); static color maxcolor(struct colormap *); static color newcolor(struct colormap *); static void freecolor(struct colormap *, color); static color pseudocolor(struct colormap *); ! static color subcolor(struct colormap *, chr c); static color newsub(struct colormap *, color); ! static void subrange(struct vars *, chr, chr, struct state *, struct state *); ! static void subblock(struct vars *, chr, struct state *, struct state *); static void okcolors(struct nfa *, struct colormap *); static void colorchain(struct colormap *, struct arc *); static void uncolorchain(struct colormap *, struct arc *); --- 95,113 ---- /* === regc_color.c === */ static void initcm(struct vars *, struct colormap *); static void freecm(struct colormap *); static color maxcolor(struct colormap *); static color newcolor(struct colormap *); static void freecolor(struct colormap *, color); static color pseudocolor(struct colormap *); ! static color subcolor(struct colormap *, chr); ! static color subcolorhi(struct colormap *, color *); static color newsub(struct colormap *, color); ! static int newhicolorrow(struct colormap *, int); ! static void newhicolorcols(struct colormap *); ! static void subcolorcvec(struct vars *, struct cvec *, struct state *, struct state *); ! static void subcoloronechr(struct vars *, chr, struct state *, struct state *, color *); ! static void subcoloronerange(struct vars *, chr, chr, struct state *, struct state *, color *); ! static void subcoloronerow(struct vars *, int, struct state *, struct state *, color *); static void okcolors(struct nfa *, struct colormap *); static void colorchain(struct colormap *, struct arc *); static void uncolorchain(struct colormap *, struct arc *); *************** static void colorcomplement(struct nfa * *** 114,120 **** #ifdef REG_DEBUG static void dumpcolors(struct colormap *, FILE *); - static void fillcheck(struct colormap *, union tree *, int, FILE *); static void dumpchr(chr, FILE *); #endif /* === regc_nfa.c === */ --- 116,121 ---- *************** static struct cvec *range(struct vars *, *** 215,220 **** --- 216,222 ---- static int before(chr, chr); static struct cvec *eclass(struct vars *, chr, int); static struct cvec *cclass(struct vars *, const chr *, const chr *, int); + static int cclass_column_index(struct colormap *, chr); static struct cvec *allcases(struct vars *, chr); static int cmp(const chr *, const chr *, size_t); static int casecmp(const chr *, const chr *, size_t); *************** brackpart(struct vars * v, *** 1467,1473 **** NOERR(); cv = eclass(v, startc, (v->cflags & REG_ICASE)); NOERR(); ! dovec(v, cv, lp, rp); return; break; case CCLASS: --- 1469,1475 ---- NOERR(); cv = eclass(v, startc, (v->cflags & REG_ICASE)); NOERR(); ! subcolorcvec(v, cv, lp, rp); return; break; case CCLASS: *************** brackpart(struct vars * v, *** 1477,1483 **** NOERR(); cv = cclass(v, startp, endp, (v->cflags & REG_ICASE)); NOERR(); ! dovec(v, cv, lp, rp); return; break; default: --- 1479,1485 ---- NOERR(); cv = cclass(v, startp, endp, (v->cflags & REG_ICASE)); NOERR(); ! subcolorcvec(v, cv, lp, rp); return; break; default: *************** brackpart(struct vars * v, *** 1523,1529 **** NOTE(REG_UUNPORT); cv = range(v, startc, endc, (v->cflags & REG_ICASE)); NOERR(); ! dovec(v, cv, lp, rp); } /* --- 1525,1531 ---- NOTE(REG_UUNPORT); cv = range(v, startc, endc, (v->cflags & REG_ICASE)); NOERR(); ! subcolorcvec(v, cv, lp, rp); } /* *************** onechr(struct vars * v, *** 1565,1610 **** { if (!(v->cflags & REG_ICASE)) { ! newarc(v->nfa, PLAIN, subcolor(v->cm, c), lp, rp); return; } /* rats, need general case anyway... */ ! dovec(v, allcases(v, c), lp, rp); ! } ! ! /* ! * dovec - fill in arcs for each element of a cvec ! */ ! static void ! dovec(struct vars * v, ! struct cvec * cv, ! struct state * lp, ! struct state * rp) ! { ! chr ch, ! from, ! to; ! const chr *p; ! int i; ! ! /* ordinary characters */ ! for (p = cv->chrs, i = cv->nchrs; i > 0; p++, i--) ! { ! ch = *p; ! newarc(v->nfa, PLAIN, subcolor(v->cm, ch), lp, rp); ! NOERR(); ! } ! ! /* and the ranges */ ! for (p = cv->ranges, i = cv->nranges; i > 0; p += 2, i--) ! { ! from = *p; ! to = *(p + 1); ! if (from <= to) ! subrange(v, from, to, lp, rp); ! NOERR(); ! } } /* --- 1567,1580 ---- { if (!(v->cflags & REG_ICASE)) { ! color lastsubcolor = COLORLESS; ! ! subcoloronechr(v, c, lp, rp, &lastsubcolor); return; } /* rats, need general case anyway... */ ! subcolorcvec(v, allcases(v, c), lp, rp); } /* diff --git a/src/backend/regex/regexport.c b/src/backend/regex/regexport.c index 93da822..a515949 100644 *** a/src/backend/regex/regexport.c --- b/src/backend/regex/regexport.c *************** *** 28,37 **** #include "regex/regexport.h" - static void scancolormap(struct colormap * cm, int co, - union tree * t, int level, chr partial, - pg_wchar **chars, int *chars_len); - /* * Get total number of NFA states. --- 28,33 ---- *************** pg_reg_colorisend(const regex_t *regex, *** 187,196 **** * * Note: we return -1 if the color number is invalid, or if it is a special * color (WHITE or a pseudocolor), or if the number of members is uncertain. ! * The latter case cannot arise right now but is specified to allow for future ! * improvements (see musings about run-time handling of higher character codes ! * in regex/README). Callers should not try to extract the members if -1 is ! * returned. */ int pg_reg_getnumcharacters(const regex_t *regex, int co) --- 183,189 ---- * * Note: we return -1 if the color number is invalid, or if it is a special * color (WHITE or a pseudocolor), or if the number of members is uncertain. ! * Callers should not try to extract the members if -1 is returned. */ int pg_reg_getnumcharacters(const regex_t *regex, int co) *************** pg_reg_getnumcharacters(const regex_t *r *** 205,211 **** if (cm->cd[co].flags & PSEUDO) /* also pseudocolors (BOS etc) */ return -1; ! return cm->cd[co].nchrs; } /* --- 198,215 ---- if (cm->cd[co].flags & PSEUDO) /* also pseudocolors (BOS etc) */ return -1; ! /* ! * If the color appears anywhere in the high colormap, treat its number of ! * members as uncertain. In principle we could determine all the specific ! * chrs corresponding to each such entry, but it would be expensive ! * (particularly if character class tests are required) and it doesn't ! * seem worth it. ! */ ! if (cm->cd[co].nuchrs != 0) ! return -1; ! ! /* OK, return the known number of member chrs */ ! return cm->cd[co].nschrs; } /* *************** pg_reg_getcharacters(const regex_t *rege *** 222,227 **** --- 226,232 ---- pg_wchar *chars, int chars_len) { struct colormap *cm; + chr c; assert(regex != NULL && regex->re_magic == REMAGIC); cm = &((struct guts *) regex->re_guts)->cmap; *************** pg_reg_getcharacters(const regex_t *rege *** 231,292 **** if (cm->cd[co].flags & PSEUDO) return; ! /* Recursively search the colormap tree */ ! scancolormap(cm, co, cm->tree, 0, 0, &chars, &chars_len); ! } ! ! /* ! * Recursively scan the colormap tree to find chrs belonging to color "co". ! * See regex/README for info about the tree structure. ! * ! * t: tree block to scan ! * level: level (from 0) of t ! * partial: partial chr code for chrs within t ! * chars, chars_len: output area ! */ ! static void ! scancolormap(struct colormap * cm, int co, ! union tree * t, int level, chr partial, ! pg_wchar **chars, int *chars_len) ! { ! int i; ! ! if (level < NBYTS - 1) ! { ! /* non-leaf node */ ! for (i = 0; i < BYTTAB; i++) ! { ! /* ! * We do not support search for chrs of color 0 (WHITE), so ! * all-white subtrees need not be searched. These can be ! * recognized because they are represented by the fill blocks in ! * the colormap struct. This typically allows us to avoid ! * scanning large regions of higher-numbered chrs. ! */ ! if (t->tptr[i] == &cm->tree[level + 1]) ! continue; ! ! /* Recursively scan next level down */ ! scancolormap(cm, co, ! t->tptr[i], level + 1, ! (partial | (chr) i) << BYTBITS, ! chars, chars_len); ! } ! } ! else { ! /* leaf node */ ! for (i = 0; i < BYTTAB; i++) { ! if (t->tcolor[i] == co) ! { ! if (*chars_len > 0) ! { ! **chars = partial | (chr) i; ! (*chars)++; ! (*chars_len)--; ! } ! } } } } --- 236,252 ---- if (cm->cd[co].flags & PSEUDO) return; ! /* ! * We need only examine the low character map; there should not be any ! * matching entries in the high map. ! */ ! for (c = CHR_MIN; c <= MAX_SIMPLE_CHR; c++) { ! if (cm->locolormap[c - CHR_MIN] == co) { ! *chars++ = c; ! if (--chars_len == 0) ! break; } } } diff --git a/src/backend/regex/regprefix.c b/src/backend/regex/regprefix.c index 04b6449..f338715 100644 *** a/src/backend/regex/regprefix.c --- b/src/backend/regex/regprefix.c *************** findprefix(struct cnfa * cnfa, *** 194,200 **** if (thiscolor == COLORLESS) break; /* The color must be a singleton */ ! if (cm->cd[thiscolor].nchrs != 1) break; /* --- 194,203 ---- if (thiscolor == COLORLESS) break; /* The color must be a singleton */ ! if (cm->cd[thiscolor].nschrs != 1) ! break; ! /* Must not have any high-color-map entries */ ! if (cm->cd[thiscolor].nuchrs != 0) break; /* diff --git a/src/include/regex/regcustom.h b/src/include/regex/regcustom.h index 459851a..04a1893 100644 *** a/src/include/regex/regcustom.h --- b/src/include/regex/regcustom.h *************** typedef unsigned uchr; /* unsigned typ *** 77,82 **** --- 77,92 ---- */ #define CHR_IS_IN_RANGE(c) ((c) <= CHR_MAX) + /* + * MAX_SIMPLE_CHR is the cutoff between "simple" and "complicated" processing + * in the color map logic. It should usually be chosen high enough to ensure + * that all common characters are <= MAX_SIMPLE_CHR. However, very large + * values will be counterproductive since they cause more regex setup time. + * Also, small values can be helpful for testing the high-color-map logic + * with plain old ASCII input. + */ + #define MAX_SIMPLE_CHR 0x7FF /* suitable value for Unicode */ + /* functions operating on chr */ #define iscalnum(x) pg_wc_isalnum(x) #define iscalpha(x) pg_wc_isalpha(x) diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h index 2f89dc9..cc73db2 100644 *** a/src/include/regex/regex.h --- b/src/include/regex/regex.h *************** extern int pg_regexec(regex_t *, const p *** 172,177 **** extern int pg_regprefix(regex_t *, pg_wchar **, size_t *); extern void pg_regfree(regex_t *); extern size_t pg_regerror(int, const regex_t *, char *, size_t); - extern void pg_set_regex_collation(Oid collation); #endif /* _REGEX_H_ */ --- 172,176 ---- diff --git a/src/include/regex/regguts.h b/src/include/regex/regguts.h index b0aa641..69816f1 100644 *** a/src/include/regex/regguts.h --- b/src/include/regex/regguts.h *************** *** 127,149 **** #define ISBSET(uv, sn) ((uv)[(sn)/UBITS] & ((unsigned)1 << ((sn)%UBITS))) - - /* - * We dissect a chr into byts for colormap table indexing. Here we define - * a byt, which will be the same as a byte on most machines... The exact - * size of a byt is not critical, but about 8 bits is good, and extraction - * of 8-bit chunks is sometimes especially fast. - */ - #ifndef BYTBITS - #define BYTBITS 8 /* bits in a byt */ - #endif - #define BYTTAB (1<flags & FREECOL) - union tree *block; /* block of solid color, if any */ }; /* * The color map itself * ! * Much of the data in the colormap struct is only used at compile time. ! * However, the bulk of the space usage is in the "tree" structure, so it's ! * not clear that there's much point in converting the rest to a more compact ! * form when compilation is finished. */ struct colormap { int magic; --- 153,208 ---- */ struct colordesc { ! int nschrs; /* number of simple chars of this color */ ! int nuchrs; /* number of upper map entries of this color */ color sub; /* open subcolor, if any; or free-chain ptr */ #define NOSUB COLORLESS /* value of "sub" when no open subcolor */ struct arc *arcs; /* chain of all arcs of this color */ ! chr firstchr; /* simple char first assigned to this color */ int flags; /* bit values defined next */ #define FREECOL 01 /* currently free */ #define PSEUDO 02 /* pseudocolor, no real chars */ #define UNUSEDCOLOR(cd) ((cd)->flags & FREECOL) }; /* * The color map itself * ! * This struct holds both data used only at compile time, and the chr to ! * color mapping information, used at both compile and run time. The latter ! * is the bulk of the space, so it's not really worth separating out the ! * compile-only portion. ! * ! * Ideally, the mapping data would just be an array of colors indexed by ! * chr codes; but for large character sets that's impractical. Fortunately, ! * common characters have smaller codes, so we can use a simple array for chr ! * codes up to MAX_SIMPLE_CHR, and do something more complex for codes above ! * that, without much loss of performance. The "something more complex" is a ! * 2-D array of color entries, where row indexes correspond to individual chrs ! * or chr ranges that have been mentioned in the regex (with row zero ! * representing all other chrs), and column indexes correspond to different ! * sets of locale-dependent character classes such as "isalpha". The ! * classbits[k] entry is zero if we do not care about the k'th character class ! * in this regex, and otherwise it is the bit to be OR'd into the column index ! * if the character in question is a member of that class. We find the color ! * of a high-valued chr by identifying which colormaprange it is in to get ! * the row index (use row zero if it's in none of them), identifying which of ! * the interesting cclasses it's in to get the column index, and then indexing ! * into the 2-D hicolormap array. ! * ! * The colormapranges are required to be nonempty, nonoverlapping, and to ! * appear in increasing chr-value order. */ + + #define NUM_CCLASSES 13 /* must match data in regc_locale.c */ + + typedef struct colormaprange + { + chr cmin; /* range represents cmin..cmax inclusive */ + chr cmax; + int rownum; /* row index in hicolormap array (>= 1) */ + } colormaprange; + struct colormap { int magic; *************** struct colormap *** 233,259 **** color free; /* beginning of free chain (if non-0) */ struct colordesc *cd; /* pointer to array of colordescs */ #define CDEND(cm) (&(cm)->cd[(cm)->max + 1]) /* If we need up to NINLINECDS, we store them here to save a malloc */ ! #define NINLINECDS ((size_t)10) struct colordesc cdspace[NINLINECDS]; - union tree tree[NBYTS]; /* tree top, plus lower-level fill blocks */ }; ! /* optimization magic to do fast chr->color mapping */ ! #define B0(c) ((c) & BYTMASK) ! #define B1(c) (((c)>>BYTBITS) & BYTMASK) ! #define B2(c) (((c)>>(2*BYTBITS)) & BYTMASK) ! #define B3(c) (((c)>>(3*BYTBITS)) & BYTMASK) ! #if NBYTS == 1 ! #define GETCOLOR(cm, c) ((cm)->tree->tcolor[B0(c)]) ! #endif ! /* beware, for NBYTS>1, GETCOLOR() is unsafe -- 2nd arg used repeatedly */ ! #if NBYTS == 2 ! #define GETCOLOR(cm, c) ((cm)->tree->tptr[B1(c)]->tcolor[B0(c)]) ! #endif ! #if NBYTS == 4 ! #define GETCOLOR(cm, c) ((cm)->tree->tptr[B3(c)]->tptr[B2(c)]->tptr[B1(c)]->tcolor[B0(c)]) ! #endif /* --- 213,239 ---- color free; /* beginning of free chain (if non-0) */ struct colordesc *cd; /* pointer to array of colordescs */ #define CDEND(cm) (&(cm)->cd[(cm)->max + 1]) + + /* mapping data for chrs <= MAX_SIMPLE_CHR: */ + color *locolormap; /* simple array indexed by chr code */ + + /* mapping data for chrs > MAX_SIMPLE_CHR: */ + int classbits[NUM_CCLASSES]; /* see comment above */ + int numcmranges; /* number of colormapranges */ + colormaprange *cmranges; /* ranges of high chrs */ + color *hicolormap; /* 2-D array of color entries */ + int maxarrayrows; /* number of array rows allocated */ + int hiarrayrows; /* number of array rows in use */ + int hiarraycols; /* number of array columns (2^N) */ + /* If we need up to NINLINECDS, we store them here to save a malloc */ ! #define NINLINECDS ((size_t) 10) struct colordesc cdspace[NINLINECDS]; }; ! /* fetch color for chr; beware of multiple evaluation of c argument */ ! #define GETCOLOR(cm, c) \ ! ((c) <= MAX_SIMPLE_CHR ? (cm)->locolormap[(c) - CHR_MIN] : pg_reg_getcolor(cm, c)) /* *************** struct colormap *** 264,269 **** --- 244,254 ---- * Representation of a set of characters. chrs[] represents individual * code points, ranges[] represents ranges in the form min..max inclusive. * + * If the cvec represents a locale-specific character class, eg [[:alpha:]], + * then the chrs[] and ranges[] arrays contain only members of that class + * up to MAX_SIMPLE_CHR (inclusive). cclasscode is set to regc_locale.c's + * code for the class, rather than being -1 as it is in an ordinary cvec. + * * Note that in cvecs gotten from newcvec() and intended to be freed by * freecvec(), both arrays of chrs are after the end of the struct, not * separately malloc'd; so chrspace and rangespace are effectively immutable. *************** struct cvec *** 276,281 **** --- 261,267 ---- int nranges; /* number of ranges (chr pairs) */ int rangespace; /* number of ranges allocated in ranges[] */ chr *ranges; /* pointer to vector of chr pairs */ + int cclasscode; /* value of "enum classes", or -1 */ }; *************** struct guts *** 489,491 **** --- 475,482 ---- int nlacons; /* size of lacons[]; note that only slots * numbered 1 .. nlacons-1 are used */ }; + + + /* prototypes for functions that are exported from regcomp.c to regexec.c */ + extern void pg_set_regex_collation(Oid collation); + extern color pg_reg_getcolor(struct colormap * cm, chr c);