Delete / Löschen
Francois Vogel
03.01.2011 - 08:30

Consistency of the definition of a word

Hi experts,

In my Tcl application one one hand I have defined tcl_wordchars to be:

set tcl_wordchars {[\w%#!?$]}

because my "words" can be constituted by the above character class and
I need the double click to select entire "words" in a text widget.

On the other hand, I'm using regular expressions to match certain
patterns in the text, for instance floating point numbers are matched
by the following pattern:

set floatingpointnumberREpat_rep
{((\.\d+)|(\m\d+(\.\d*)?))([deDE][+\-]?\d{1,3})?\M}

In this regexp I'm using \m and \M as constraints to match at the
beginning and end of a "word". The problem is that "word" in this
regexp context has a different meaning than above. From the Tcl
re_syntax man page: "A word is defined as a sequence of word
characters that is neither preceded nor followed by word characters. A
word character is an alnum character or an underscore (“_”)."

Therefore, a word is (or at least can be) a subtly different thing in
different areas of Tcl. Any thoughts about consistency? Don't you
think \m and \M should match at the beginning/end of words as defined
by $tcl_wordchars?
Can we call this a bug? Or a feature request?

Last question, I will have to modify my regexp above so that it still
matches floating point numbers while not matching in my "words", such
as abc#75 (currently, "75" is matched by the above floating point
pattern, while it should not). Advices appreciated here too, thanks.

Francois

Alexandre Ferrieux
03.01.2011 - 11:56
On 3 jan, 08:30, Francois Vogel <fsvogelnew5NOS...@free.fr> wrote:
Hi experts,

In my Tcl application one one hand I have defined tcl_wordchars to be:

set tcl_wordchars {[\w%#!?$]}

because my "words" can be constituted by the above character class and
I need the double click to select entire "words" in a text widget.

On the other hand, I'm using regular expressions to match certain
patterns in the text, for instance floating point numbers are matched
by the following pattern:

set floatingpointnumberREpat_rep
{((\.\d+)|(\m\d+(\.\d*)?))([deDE][+\-]?\d{1,3})?\M}

In this regexp I'm using \m and \M as constraints to match at the
beginning and end of a "word". The problem is that "word" in this
regexp context has a different meaning than above. From the Tcl
re_syntax man page: "A word is defined as a sequence of word
characters that is neither preceded nor followed by word characters. A
word character is an alnum character or an underscore ( _ )."

Therefore, a word is (or at least can be) a subtly different thing in
different areas of Tcl. Any thoughts about consistency? Don't you
think \m and \M should match at the beginning/end of words as defined
by $tcl_wordchars?
Can we call this a bug? Or a feature request?

From the beginning, tcl_wordchars has been a Tk (and library/word.tcl)
thing. It does not touch the screaming metal of the RE engine:
re_syntax.n does not mention it.
So it's not a bug. And I'd argue that what you're asking for is a
questionable feature, essentially making all regexps in the system
suddenly dependent on the dynamic value of a variable. Doing so would
make any [regexp/regsub] hidden in a library call completely
unpredictable.

Last question, I will have to modify my regexp above so that it still
matches floating point numbers while not matching in my "words", such
as abc#75 (currently, "75" is matched by the above floating point
pattern, while it should not). Advices appreciated here too, thanks.

Why not directly use your handcrafted definition of a word *both* in
$tcl_wordchar (for Tk), and in your RE for floats ?

Are you aware of RE lookahead (RE) ?

-Alex

Francois Vogel
03.01.2011 - 22:04
Alexandre Ferrieux said on 03/01/2011 11:56:

it's not a bug. And I'd argue that what you're asking for is a
questionable feature, essentially making all regexps in the system
suddenly dependent on the dynamic value of a variable. Doing so would
make any [regexp/regsub] hidden in a library call completely
unpredictable.

I understand your point and it makes sense indeed.
I can't however refrain from thinking that consistency of what is
called a "word" throughout Tcl/Tk is questionable as well. But OK,
let's live with this subtlety.

modify my regexp above so that it still
matches floating point numbers while not matching in my "words"

Why not directly use your handcrafted definition of a word *both* in
$tcl_wordchar (for Tk), and in your RE for floats ?

Yes that's the lead I had in mind too.

Are you aware of RE lookahead (RE) ?

Yes, I do. I will try this and come back if I can't sort this out.
First thing is I can imagine how to replace \M with a negative
lookahead "(?!$tcl_wordchars)", but I have not yet figured out how to
replace \m without a lookbehind (which Tcl does not implement).

Thanks for having answered.
Francois


Alexandre Ferrieux
03.01.2011 - 22:42
On Jan 3, 10:040pm, Francois Vogel <fsvogelnew5NOS...@free.fr> wrote:
Alexandre Ferrieux said on 03/01/2011 11:56:



> it's not a bug. And I'd argue that what you're asking for is a
> questionable feature, essentially making all regexps in the system
> suddenly dependent on the dynamic value of a variable. Doing so would
> make any [regexp/regsub] hidden in a library call completely
> unpredictable.

I understand your point and it makes sense indeed.
I can't however refrain from thinking that consistency of what is
called a "word" throughout Tcl/Tk is questionable as well. But OK,
let's live with this subtlety.

>> modify my regexp above so that it still
>> matches floating point numbers while not matching in my "words"

> Why not directly use your handcrafted definition of a word *both* in
> $tcl_wordchar (for Tk), and in your RE for floats ?

Yes that's the lead I had in mind too.

> Are you aware of RE lookahead (?DRE) ?

Yes, I do. I will try this and come back if I can't sort this out.
First thing is I can imagine how to replace \M with a negative
lookahead "(?!$tcl_wordchars)",

Or equivalently, a positive lookahead containing the negated character
range. Dunno which is fastest; you may want to try ;-)

but I have not yet figured out how to
replace \m without a lookbehind (which Tcl does not implement).

A lookbehind can be emulated by a simple match (in regexp), or a
captured match repeated in the output with \1 (in regsub):

regsub -all {([^0-9])([0-9]+)(?D[^0-9])} $s {\1haha\2hehe} s

-Alex


Francois Vogel
04.01.2011 - 21:39
Alexandre Ferrieux said on 03/01/2011 22:42:

First thing is I can imagine how to replace \M with a negative
lookahead "(?!$tcl_wordchars)",

Or equivalently, a positive lookahead containing the negated character
range. Dunno which is fastest; you may want to try ;-)

This triggers another (side) question. Why does Tk define *both*
tcl_wordchars and tcl_nonwordchars, while the second one should be the
negation of the first one (any other case, perhaps?)

For instance on Windows, $tcl_wordchars defaults to \S i.e.
[[:space:]] and $tcl_nonwordchars defaults to \s i.e. [^[:space:]]. On
Linux the same applies: $tcl_nonwordchars is the negation of
$tcl_wordchars

So what is the interest of having both?


but I have not yet figured out how to
replace \m without a lookbehind (which Tcl does not implement).

A lookbehind can be emulated by a simple match (in regexp), or a
captured match repeated in the output with \1 (in regsub):

My use case is a regexp. The thing is I don't want the emulated
lookbehind to be part of the match, and the matched thing shall be in
matchVar, not in subMatchVar (because the code using the floating
point numbers regexp is generic and is used as well for matching other
things, all returning the desired match in the main match variable.

For instance here is a test snippet:

set str {4.6
abc#er = 89;
abc#ed75 = 65;
za?145# = 77+7;
za?145 8;
abc#75 = 65;
#4 = 55;
!utr_d!f?rr34 = 44;
q= 1.1E1;
145}

What I want to match is all numbers which are not buried in one of my
"words". This means in practice the correct matches are the numbers
located at the right of all equal signs (but of course I can't use
this trick).


My initial problem was that:

set pat {\m(?:(?:\.\d+)|(?:\d+(?:\.\d*)?))(?:[deDE][+\-]?\d{1,3})?\M}
set allmatch [regexp -all -inline -- $pat $str]

will also provide the numbers on the left hand side. This is due to \m
and \M word definition differing from mine, this is clear. However,
note that the above correctly provides the first match (4.6) and the
last one (145).


But now how to match all what I want and only what I want? The
lookahead replacement for \M is clear since lookahead should
straightforwardly translate to a constraint:

set pat
{\m(?:(?:\.\d+)|(?:\d+(?:\.\d*)?))(?:[deDE][+\-]?\d{1,3})?([^\w%#!?$])}
set allmatch [regexp -all -inline -- $pat $str]

Note however that this is not fully correct: the last match (145) is
now missing! This problem can be solved by replacing the lookahead by
a more complicated one: (([^\w%#!?$])|\Z)
OK!


Now, emulating the lookbehind by a regular match does not qualify
directly as a solution for my problem, since it suffers from one main
drawback: It will match one character before the real match I'm
looking for, and this would require additional processing (remove
first char of the match) just for the case of matching floating point
numbers, which I would like to avoid.

Also, to avoid wrong matching if the correct match starts at first
character of $str (see example of 4.6 in $str) I need to replace \m by
(?:[^\w%#!?$]|\A).

So the best proposal I have now is:

set pat
{(?:[^\w%#!?$]|\A)(?:(?:\.\d+)|(?:\d+(?:\.\d*)?))(?:[deDE][+\-]?\d{1,3})?([^\w%#!?$])}
set allmatch [regexp -all -inline -- $pat $str]


But this is not what I want: it matches one character before the real
start of the correct match.
(Well, most often. Exception is when the match starts at the beginning
of $str. Headache in perspective to deal with this special case.)
I don't want to post-process the match.

Ideas, anyone?

Thanks,
Francois


Alexandre Ferrieux
05.01.2011 - 14:20
On 4 jan, 21:39, Francois Vogel <fsvogelnew5NOS...@free.fr> wrote:

This triggers another (side) question. Why does Tk define *both*
tcl_wordchars and tcl_nonwordchars, while the second one should be the
negation of the first one (any other case, perhaps?)

Dunno, maybe mere convenience, and the possibility of more readable
shorthands like \s vs. \S and \w vs. \W.

> A lookbehind can be emulated by a simple match (in regexp), or a
> captured match repeated in the output with \1 (in regsub):

My use case is a regexp. The thing is I don't want the emulated
lookbehind to be part of the match, and the matched thing shall be in
matchVar, not in subMatchVar (because the code using the floating
point numbers regexp is generic and is used as well for matching other
things, all returning the desired match in the main match variable.

Then you're putting too many constraints. The way you organize your
vars is your problem; very little post-processing in needed to adapt
your internal APIs from using matchVar to using subMatchVar:

set l {}; foreach v [regexp -all {(?:^|NOTWORD)(\d+)(NOTWORD)}
$x] {lappend l [lindex $v 1]}

Afraid your only other option is to propose a new lookbehind feature
in the RE engine. Be prepare to argue in the TIP process, because
adding something nontrivial to something complex just to avoid a few
lines of script requires ... persuasion.

(Well, most often. Exception is when the match starts at the beginning
of $str. Headache in perspective to deal with this special case.)
I don't want to post-process the match.

See above how a simple ^|... helps ;-)

-Alex

Francois Vogel
06.01.2011 - 13:21
Alexandre Ferrieux said on 05/01/2011 14:20:
very little post-processing in needed to adapt
your internal APIs from using matchVar to using subMatchVar

That's right, of course.

Nevertheless this definitely makes the matching procedure for floats
specific to matching just floating point numbers, whereas it's
currently a generic matching procedure used to match several other
patterns passed as an argument.

Disappointing.

Unless I modify all other regexp patterns as well such that they
return the interesting thing in subMatchVar. Well, not something I
want to do either.

Afraid your only other option is to propose a new lookbehind feature
in the RE engine. Be prepare to argue in the TIP process, because
adding something nontrivial to something complex just to avoid a few
lines of script requires ... persuasion.

Did I ever suggest this is my intention? I don't think so. I'm
watching the tcl-core list for years, and have seen the discussions
happening in such cases.

Francois


Alexandre Ferrieux
06.01.2011 - 18:34
On 6 jan, 13:21, Francois Vogel <fsvogelnew5NOS...@free.fr> wrote:
Alexandre Ferrieux said on 05/01/2011 14:20:

> very little post-processing in needed to adapt
> your internal APIs from using matchVar to using subMatchVar

That's right, of course.

Nevertheless this definitely makes the matching procedure for floats
specific to matching just floating point numbers, whereas it's
currently a generic matching procedure used to match several other
patterns passed as an argument.

No, you're missing the point: decide that in all cases your API uses
submatch 1 instead of fullmatch.
This means an extra pair of parentheses for other regexps, but that
buys you genericity.

{(normal RE without lookbehind)}

{lookbehind(RE needing lookbehind)}

And, if the '()' characters are absent from your patterns, you might
even detect their absence in

{normal RE without lookbehind}

and hence preserve your current REs.

-Alex




Share/Bookmark

<