I made a typo in the code I sent.

This
exPattern = @"(\D+[\.]* +\D* *\d+ *,0x *\d+)";
should be
exPattern = @"(\D+[\.]* +\D* *\d+ *, *\d+)";

Even with the correction, I am experiencing the same behavior.

I made the change that you suggested:
Match mt = rx.Match(inText);
and saw it take a long time. This is a little different than the scenario
that I was seeing. In the program I sent, it would make it past the
match with the blink of an eye. The place that it would hang is
accessing the Count property

mc = rx.Matches(inText);
if (mc != null)
{
count += mc.Count; <=== HERE

If I am using the debugger, and step so I am sitting on the
if (mc != null)
line, drag the mc variable to the watch window, and expand it, the
operation will take about 15 seconds. Looking at the fields, several
of the fields show errors:
Count = "error: cannot obtain value"


"Can you tell me what kind of pattern you are going to match so that we can
figure out another pattern for it?"

I am processing documents in multiple languages with dates / times / numerics
in different formats. The date formats I am handling are:
MM/dd/yy
MM/dd/yyyy
ddd, MMMM dd, yyyy
dd-MMM-yyyy
dd MMMM yyyy
MMM. dd, yy
MMMM dd, yyyy
yyyy/MM/dd

Some languages have 2 words to represent a single month. Taking
this into consideration as well as different seperators and sometimes numerics
preceeded with a Zero, I have boiled the above formats into 4 patterns:

1) exPattern = @"(\d+(?<mark>[-| |/|\.])[^\d|^ ]+ *[^\d|^ ]*\k<mark>\d+)";
2) exPattern = @"(\d+ *(?<mark>[ |/|\.|-]) *\d+ *\k<mark> *\d+)";
3) exPattern = @"(\D+[\.]* +\d+ *, *\d+)";
4) exPattern = @"(\D+[\.]* +\D* *\d+ *, *\d+)";

The first 3 patterns work without any problem. The 4th one is the one that
causes the hang.

Thanks,
Dave

RE: MatchCollection.Matches() Hangs when attempting to access data member by v-phuang

v-phuang
Thu Jul 01 20:43:15 CDT 2004

Hi David,

1. Based on my test, the code below will do the regular match search
immediately, so it will take a long time because of the complicated regular
expression.
Match mt = rx.Match(inText);
But with Matches method, the regex will do the search job until we access
to the matchcollection's properties returned by the Matches method.
That's why It will take long to return the value, because at that time the
search job did not finished, we can not figure out the correct count value.

2. From your description, the \D in your pattern is just used to match the
alphabet of lowercase and uppercase, am I right?
If so I think you may try to change your pattern as below.
exPattern = @"([a-zA-Z]+[\.]* +[a-zA-Z]* *\d+ *, *\d+)";
As the msdn said, the \D will be more than the alphabetic set which will
slower the match speed.
So I think we do not need to \D to match the [a-zA-Z].

Character Classes
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/ht
ml/cpconcharacterclasses.asp

Best regards,

Peter Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.


Re: MatchCollection.Matches() Hangs when attempting to access data member by David

David
Fri Jul 02 08:27:26 CDT 2004

Peter,

The proposed solution works, but I would like to close the loop on this.
The solution is a little perplexing and a little disturbing.

These 2 expressions are almost identical.
[a-zA-Z] --- Matches upper/lower characters
\D --- Matches any nondigit

The difference would be \D also takes into account punctuation. I would
expect both of them to produce similar results in regards to hanging.

Given that \D is doing more, I would agree that the time to perform the
operation would be longer. This seems a little out of whack though...

Completion Time hh:mm:ss:fff
============
\D ------ 00:19:20:734
[a-zA-Z] ------ 00:00:00:171


You said,
"But with Matches method, the regex will do the search job until we access
to the matchcollection's properties returned by the Matches method."

I assume you mean "won't do the search until"? This would account for the
immediate return. From a programming aspect, I would expect to actually perform
the match at the time of the Matches() call. Doing the search upon access
to a member/method of the collection class seems misleading. I would expect the
collection to be null if there are no objects or the collection exists with
Count set to Zero after the Matches() call.

You said,
"That's why It will take long to return the value, because at that time the
"search job did not finished, we can not figure out the correct count value."

I am interpreting your message to mean that the Regex is threaded.
Is this correct?


Thanks for all your help. It is well appreciated.

Cheers,
Dave



On Fri, 02 Jul 2004 01:43:15 GMT, v-phuang@online.microsoft.com ("Peter Huang") wrote:

>Hi David,
>
>1. Based on my test, the code below will do the regular match search
>immediately, so it will take a long time because of the complicated regular
>expression.
>Match mt = rx.Match(inText);
>But with Matches method, the regex will do the search job until we access
>to the matchcollection's properties returned by the Matches method.
>That's why It will take long to return the value, because at that time the
>search job did not finished, we can not figure out the correct count value.
>
>2. From your description, the \D in your pattern is just used to match the
>alphabet of lowercase and uppercase, am I right?
>If so I think you may try to change your pattern as below.
>exPattern = @"([a-zA-Z]+[\.]* +[a-zA-Z]* *\d+ *, *\d+)";
>As the msdn said, the \D will be more than the alphabetic set which will
>slower the match speed.
>So I think we do not need to \D to match the [a-zA-Z].
>
>Character Classes
>http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpgenref/ht
>ml/cpconcharacterclasses.asp
>
>Best regards,
>
>Peter Huang
>Microsoft Online Partner Support
>
>Get Secure! - www.microsoft.com/security
>This posting is provided "AS IS" with no warranties, and confers no rights.


Re: MatchCollection.Matches() Hangs when attempting to access data member by v-phuang

v-phuang
Sun Jul 04 22:01:10 CDT 2004

Hi David,

I am sorry for mistyping, yes, I mean that the regex will do the search job
until we access to its properties. I agree with you that this seems to be
somewhat misleading, but I think the design is due to the performance
concern because of the matches operation will do more job than the match
do. Also I think if there is no result we will get the matches collection
with the count property of zero.

Also the Regex is synchronized and the match operation will run at the same
thread as where we call the regex.matches and that is why we will get a
period of time "hang", when the regex is doing the seach job.

As for your concern about the punctuation, I think we can just specify the
ones we want in the [] as we do with the [a-zA-Z].

Best regards,

Peter Huang
Microsoft Online Partner Support

Get Secure! - www.microsoft.com/security
This posting is provided "AS IS" with no warranties, and confers no rights.