I am new to using both dotnet and regex. I have done the basic reading to
the point I thought I know how to use regex to extract date string. But I
ran into problems.


what is the best regex expression to look for month names or date string for
that matter?

from my testing, I could use
"((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))"
not
'([ADFJMNOS][ACEOPU][BCGLNPRTVY])"
In other word I got syntax problem with the month pattern

I am working towards dealing with various date format I deal with
My object is to get the entire date string and parse into yyyy-mm-dd or
whatever the dotnet conversion routine will take.
I will have to deal with many long strings of 64K to 200K . This is the
reason I am locking for a good regex expression to minimize delays from
processing

I know I have to deal with
yyyy-mm-dd ( and variants thereof with dot or slash as separator instead
of dash, single digit month or day)
yyyy-MMM-dd ( or just space instead of -)
MMM d, yy ( or yyyy)
and the tougher ones like
d MMM yyyy
d MMM yy

Re: regex syntax by Alvin

Alvin
Sun Aug 14 14:39:46 CDT 2005

have a look at regexlib.com for customized expressions

--
Regards,
Alvin Bruney
[Shameless Author Plug]
The Microsoft Office Web Components Black Book with .NET
available at www.lulu.com/owc, Amazon, B&H etc


Forth-coming VSTO.NET
-------------------------------------------------------------------------------
"jg" <junk@mail.pls> wrote in message
news:%23aS6MZIoFHA.2540@TK2MSFTNGP15.phx.gbl...
>I am new to using both dotnet and regex. I have done the basic reading to
>the point I thought I know how to use regex to extract date string. But I
>ran into problems.
>
>
> what is the best regex expression to look for month names or date string
> for that matter?
>
> from my testing, I could use
>
> "((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))"
> not
> '([ADFJMNOS][ACEOPU][BCGLNPRTVY])"
> In other word I got syntax problem with the month pattern
>
> I am working towards dealing with various date format I deal with
> My object is to get the entire date string and parse into yyyy-mm-dd or
> whatever the dotnet conversion routine will take.
> I will have to deal with many long strings of 64K to 200K . This is the
> reason I am locking for a good regex expression to minimize delays from
> processing
>
> I know I have to deal with
> yyyy-mm-dd ( and variants thereof with dot or slash as separator
> instead of dash, single digit month or day)
> yyyy-MMM-dd ( or just space instead of -)
> MMM d, yy ( or yyyy)
> and the tougher ones like
> d MMM yyyy
> d MMM yy
>



Re: regex syntax by jg

jg
Thu Aug 18 00:29:23 CDT 2005

thank you

However, I have no luck accessing that content. all I got was the Green
Logos. did not see anything.

"Alvin Bruney [MVP - ASP.NET]" <www.lulu.com/owc> wrote in message
news:Or%23h4fQoFHA.764@TK2MSFTNGP14.phx.gbl...
> have a look at regexlib.com for customized expressions
>
> --
> Regards,
> Alvin Bruney
> [Shameless Author Plug]
> The Microsoft Office Web Components Black Book with .NET
> available at www.lulu.com/owc, Amazon, B&H etc
>
>
> Forth-coming VSTO.NET
> -------------------------------------------------------------------------------
> "jg" <junk@mail.pls> wrote in message
> news:%23aS6MZIoFHA.2540@TK2MSFTNGP15.phx.gbl...
>>I am new to using both dotnet and regex. I have done the basic reading to
>>the point I thought I know how to use regex to extract date string. But I
>>ran into problems.
>>
>>
>> what is the best regex expression to look for month names or date string
>> for that matter?
>>
>> from my testing, I could use
>>
>> "((JAN)|(FEB)|(MAR)|(APR)|(MAY)|(JUN)|(JUL)|(AUG)|(SEP)|(OCT)|(NOV)|(DEC))"
>> not
>> '([ADFJMNOS][ACEOPU][BCGLNPRTVY])"
>> In other word I got syntax problem with the month pattern
>>
>> I am working towards dealing with various date format I deal with
>> My object is to get the entire date string and parse into yyyy-mm-dd or
>> whatever the dotnet conversion routine will take.
>> I will have to deal with many long strings of 64K to 200K . This is the
>> reason I am locking for a good regex expression to minimize delays from
>> processing
>>
>> I know I have to deal with
>> yyyy-mm-dd ( and variants thereof with dot or slash as separator
>> instead of dash, single digit month or day)
>> yyyy-MMM-dd ( or just space instead of -)
>> MMM d, yy ( or yyyy)
>> and the tougher ones like
>> d MMM yyyy
>> d MMM yy
>>
>
>



Re: regex syntax by Oliver

Oliver
Thu Aug 18 04:03:51 CDT 2005

jg wrote:

> I know I have to deal with
> yyyy-mm-dd ( and variants thereof with dot or slash as separator instead
> of dash, single digit month or day)
> yyyy-MMM-dd ( or just space instead of -)
> MMM d, yy ( or yyyy)
> and the tougher ones like
> d MMM yyyy
> d MMM yy

I have created a regex for you that works with all those samples. Here
it is:

(?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ |
(?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$
|
(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$
|
(?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$

I tried this with the following samples, constructed from the templates
you gave:

2005-03-08
2005.03.08
2005/03/08
2005 03 08
2005 3 08
2005 3 8
2005 03 8
2005-MAR-08
2005 MAR 08
2005 MAR 8
MAR 8, 2005
MAR 08, 2005
MAR 8, 05
MAR 08, 05
8 MAR 2005
8 MAR 05
08 MAR 2005
08 MAR 05

As you can see, the expression is comprised of four different parts.
Each of these has a $ sign at the end, which you'll want to get rid of
before using the expression with your own long string. This is only
needed to test the expression in Regulator with multiple samples.

I tried this with the IgnoreWhitespace and the IgnoreCase options
switched on.

Hope this helps!

(If you have any trouble with the regex, I could send you the saved
Regulator file. Just in case things get mangled in the message or
something.)


Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog

Re: regex syntax by jg

jg
Thu Aug 18 12:52:12 CDT 2005

that is absolutely wonderful and helpful. Thank you very much. Your efforts
are well appreciated.
Thank you very much again for testing and explaining.

I will try that out..

"Oliver Sturm" <oliver@sturmnet.org> wrote in message
news:%23HZlBO9oFHA.2472@TK2MSFTNGP15.phx.gbl...
> jg wrote:
>
>> I know I have to deal with
>> yyyy-mm-dd ( and variants thereof with dot or slash as separator
>> instead of dash, single digit month or day)
>> yyyy-MMM-dd ( or just space instead of -)
>> MMM d, yy ( or yyyy)
>> and the tougher ones like
>> d MMM yyyy
>> d MMM yy
>
> I have created a regex for you that works with all those samples. Here it
> is:
>
> (?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ |
> (?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$
> |
> (?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$
> |
> (?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$
>
> I tried this with the following samples, constructed from the templates
> you gave:
>
> 2005-03-08
> 2005.03.08
> 2005/03/08
> 2005 03 08
> 2005 3 08
> 2005 3 8
> 2005 03 8
> 2005-MAR-08
> 2005 MAR 08
> 2005 MAR 8
> MAR 8, 2005
> MAR 08, 2005
> MAR 8, 05
> MAR 08, 05
> 8 MAR 2005
> 8 MAR 05
> 08 MAR 2005
> 08 MAR 05
>
> As you can see, the expression is comprised of four different parts. Each
> of these has a $ sign at the end, which you'll want to get rid of before
> using the expression with your own long string. This is only needed to
> test the expression in Regulator with multiple samples.
>
> I tried this with the IgnoreWhitespace and the IgnoreCase options switched
> on.
>
> Hope this helps!
>
> (If you have any trouble with the regex, I could send you the saved
> Regulator file. Just in case things get mangled in the message or
> something.)
>
>
> Oliver Sturm
> --
> omnibus ex nihilo ducendis sufficit unum
> Spaces inserted to prevent google email destruction:
> MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
> ICQ 27142619 http://www.sturmnet.org/blog



Re: regex syntax by jg

jg
Fri Aug 19 00:38:22 CDT 2005

Great, it works even after taking out the $ and the space around the |.. I
did add \b before the entire expression to make sure the first part of the
date is on the word boundary. This way I can avoid some supposedly low
probability errors like some strange catalogue dot or dash notations



Now all I have to do is to make it work with January, February,... ( fully
spelled month names). I guess I can always add another 12 | parts to the
month expressions

"jg" <junk@mail.pls> wrote in message
news:%23VC0r1BpFHA.3380@TK2MSFTNGP12.phx.gbl...
> that is absolutely wonderful and helpful. Thank you very much. Your
> efforts are well appreciated.
> Thank you very much again for testing and explaining.
>
> I will try that out..
>
> "Oliver Sturm" <oliver@sturmnet.org> wrote in message
> news:%23HZlBO9oFHA.2472@TK2MSFTNGP15.phx.gbl...
>> jg wrote:
>>
>>> I know I have to deal with
>>> yyyy-mm-dd ( and variants thereof with dot or slash as separator
>>> instead of dash, single digit month or day)
>>> yyyy-MMM-dd ( or just space instead of -)
>>> MMM d, yy ( or yyyy)
>>> and the tougher ones like
>>> d MMM yyyy
>>> d MMM yy
>>
>> I have created a regex for you that works with all those samples. Here it
>> is:
>>
>> (?<year>\d{4})[-\./\s](?<month>\d{1,2})[-\./\s](?<day>\d{1,2})$ |
>> (?<year>\d{4})[-\s](?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)[-\s](?<day>\d{1,2})$
>> |
>> (?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<day>\d{1,2}),\s*?(?<year>\d{4}|\d{2})$
>> |
>> (?<day>\d{1,2})\s(?<month>JAN|FEB|MAR|APR|MAY|JUN|JUL|AUG|SEP|OCT|NOV|DEC)\s(?<year>\d{4}|\d{2})$
>>
>> I tried this with the following samples, constructed from the templates
>> you gave:
>>
>> 2005-03-08
>> 2005.03.08
>> 2005/03/08
>> 2005 03 08
>> 2005 3 08
>> 2005 3 8
>> 2005 03 8
>> 2005-MAR-08
>> 2005 MAR 08
>> 2005 MAR 8
>> MAR 8, 2005
>> MAR 08, 2005
>> MAR 8, 05
>> MAR 08, 05
>> 8 MAR 2005
>> 8 MAR 05
>> 08 MAR 2005
>> 08 MAR 05
>>
>> As you can see, the expression is comprised of four different parts. Each
>> of these has a $ sign at the end, which you'll want to get rid of before
>> using the expression with your own long string. This is only needed to
>> test the expression in Regulator with multiple samples.
>>
>> I tried this with the IgnoreWhitespace and the IgnoreCase options
>> switched on.
>>
>> Hope this helps!
>>
>> (If you have any trouble with the regex, I could send you the saved
>> Regulator file. Just in case things get mangled in the message or
>> something.)
>>
>>
>> Oliver Sturm
>> --
>> omnibus ex nihilo ducendis sufficit unum
>> Spaces inserted to prevent google email destruction:
>> MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
>> ICQ 27142619 http://www.sturmnet.org/blog
>
>



Re: regex syntax by Oliver

Oliver
Fri Aug 19 03:41:04 CDT 2005

jg wrote:

> Great, it works even after taking out the $ and the space around the |.. I
> did add \b before the entire expression to make sure the first part of the
> date is on the word boundary. This way I can avoid some supposedly low
> probability errors like some strange catalogue dot or dash notations

Sure, I didn't know your exact circumstances, so you'd have to make
modifications to my sample to make it work for you completely.

> Now all I have to do is to make it work with January, February,... ( fully
> spelled month names). I guess I can always add another 12 | parts to the
> month expressions

Sure you can. If you find the whole thing growing too much, maybe you
could define the various parts you need (the month expression, the day
expression, the two digit year, the four digit year) as string constants
in your code and use a String.Format to put them together to form the
complete regular expression before you use it. That way it might be a
bit more maintainable - otherwise you'll have to make every change to
one of the parts in many places, increasing the probability of an error.



Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog

Re: regex syntax by jg

jg
Tue Aug 23 00:32:09 CDT 2005

thank you again. you are wonderfully helpful.

I did find the pattern string getting too huge. So I started to split date
pattern into 3 components before using them to compose the final pattern,
although I did not use the string format method.


"Oliver Sturm" <oliver@sturmnet.org> wrote in message
news:eQV5%23lJpFHA.620@TK2MSFTNGP15.phx.gbl...
> jg wrote:
>
>> Great, it works even after taking out the $ and the space around the |..
>> I did add \b before the entire expression to make sure the first part of
>> the date is on the word boundary. This way I can avoid some supposedly
>> low probability errors like some strange catalogue dot or dash notations
>
> Sure, I didn't know your exact circumstances, so you'd have to make
> modifications to my sample to make it work for you completely.
>
>> Now all I have to do is to make it work with January, February,... (
>> fully spelled month names). I guess I can always add another 12 | parts
>> to the month expressions
>
> Sure you can. If you find the whole thing growing too much, maybe you
> could define the various parts you need (the month expression, the day
> expression, the two digit year, the four digit year) as string constants
> in your code and use a String.Format to put them together to form the
> complete regular expression before you use it. That way it might be a bit
> more maintainable - otherwise you'll have to make every change to one of
> the parts in many places, increasing the probability of an error.
>
>
>
> Oliver Sturm
> --
> omnibus ex nihilo ducendis sufficit unum
> Spaces inserted to prevent google email destruction:
> MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
> ICQ 27142619 http://www.sturmnet.org/blog



Re: regex syntax by Oliver

Oliver
Tue Aug 23 04:14:11 CDT 2005

jg wrote:

> I did find the pattern string getting too huge. So I started to split date
> pattern into 3 components before using them to compose the final pattern,
> although I did not use the string format method.

Well, if you ask me, you should always use String.Format when putting
together strings from more than two parts. A String.Format call can
create an arbitrarily complicated string in one operation, while a
concatenation a + b + c takes two operations at least. Strings are
immutable in .NET, so a + b + c will end up allocating several new
strings before the final result is ready.

The argument against this is that the compiler might get rid of some of
the overhead for you, at least when a, b and c are static strings. But I
don't like to depend on that, especially when the String.Format call is
usually so much better readable:

"At " + time.ToString() + ", the user " + user + "had a problem
accessing the " + resource + "resource."

String.Format("At {0}, the user {1} had a problem accessing the {2}
resource.", time, user, resource);



Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog

Re: regex syntax by jg

jg
Tue Aug 23 23:09:31 CDT 2005

Now I see. pardon my ignorance

Thank you again. much appreciated.

"Oliver Sturm" <oliver@sturmnet.org> wrote in message
news:eu9XMM8pFHA.3544@TK2MSFTNGP15.phx.gbl...
> jg wrote:
>
>> I did find the pattern string getting too huge. So I started to split
>> date pattern into 3 components before using them to compose the final
>> pattern, although I did not use the string format method.
>
> Well, if you ask me, you should always use String.Format when putting
> together strings from more than two parts. A String.Format call can create
> an arbitrarily complicated string in one operation, while a concatenation
> a + b + c takes two operations at least. Strings are immutable in .NET, so
> a + b + c will end up allocating several new strings before the final
> result is ready.
>
> The argument against this is that the compiler might get rid of some of
> the overhead for you, at least when a, b and c are static strings. But I
> don't like to depend on that, especially when the String.Format call is
> usually so much better readable:
>
> "At " + time.ToString() + ", the user " + user + "had a problem accessing
> the " + resource + "resource."
>
> String.Format("At {0}, the user {1} had a problem accessing the {2}
> resource.", time, user, resource);
>
>
>
> Oliver Sturm
> --
> omnibus ex nihilo ducendis sufficit unum
> Spaces inserted to prevent google email destruction:
> MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
> ICQ 27142619 http://www.sturmnet.org/blog



Re: regex syntax by Jon

Jon
Wed Aug 24 01:18:32 CDT 2005

Oliver Sturm <oliver@sturmnet.org> wrote:
> Well, if you ask me, you should always use String.Format when putting
> together strings from more than two parts.

I disagree.

> A String.Format call can
> create an arbitrarily complicated string in one operation, while a
> concatenation a + b + c takes two operations at least.

What do you count as an operation? Bear in mind that String.Format has
to do a lot more work in terms of parsing etc - I very much doubt that
there are many cases where it's more efficient.

> Strings are
> immutable in .NET, so a + b + c will end up allocating several new
> strings before the final result is ready.

That's not true if a, b and c are already strings. a+b+c will simply
result in a call to String.Concat(a, b, c) which creates one string
without creating any intermediate ones. It's not like a+b+c is compiled
into (a+b)+c, evaluating a+b first.

string a = "a";
string b = "b";
string c = "c";

string x = a+b+c;

is compiled into:

IL_0000: ldstr "a"
IL_0005: stloc.0
IL_0006: ldstr "b"
IL_000b: stloc.1
IL_000c: ldstr "c"
IL_0011: stloc.2
IL_0012: ldloc.0
IL_0013: ldloc.1
IL_0014: ldloc.2
IL_0015: call string [mscorlib]System.String::Concat(string,
string,
string)
IL_001a: stloc.3

> The argument against this is that the compiler might get rid of some of
> the overhead for you, at least when a, b and c are static strings. But I
> don't like to depend on that

You can depend on it in C# at least - it's in the specification, IIRC.

> especially when the String.Format call is
> usually so much better readable:
>
> "At " + time.ToString() + ", the user " + user + "had a problem
> accessing the " + resource + "resource."
>
> String.Format("At {0}, the user {1} had a problem accessing the {2}
> resource.", time, user, resource);

Sometimes String.Format is more readable; sometimes it's less readable.
In almost all cases, readability should be the key to determining which
to use.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too

Re: regex syntax by Oliver

Oliver
Wed Aug 24 03:46:53 CDT 2005

Jon Skeet [C# MVP] wrote:

>>Well, if you ask me, you should always use String.Format when putting
>>together strings from more than two parts.
>
> I disagree.

I guess I should have qualified my statement better. I might have added
conditions like "and at least one of the parts is not a string in itself".

>>The argument against this is that the compiler might get rid of some of
>>the overhead for you, at least when a, b and c are static strings. But I
>>don't like to depend on that
>
> You can depend on it in C# at least - it's in the specification, IIRC.

I would readily assume it even without reading the specs. I would make a
test if it were in any way important to me. Until then, I wouldn't
depend on it.

>>especially when the String.Format call is
>>usually so much better readable:
>>
>> "At " + time.ToString() + ", the user " + user + "had a problem
>>accessing the " + resource + "resource."
>>
>> String.Format("At {0}, the user {1} had a problem accessing the {2}
>>resource.", time, user, resource);
>
>
> Sometimes String.Format is more readable; sometimes it's less readable.
> In almost all cases, readability should be the key to determining which
> to use.

Right, that was my most important point as well. But apart from
concatenations of literal strings or variables/constants holding
strings, I can't imagine cases where the + concatenation would be more
readable (see above, IMO). Even in these cases I might tend to use
String.Format because during the course of development I find it much
easier to extend and change. I can always change it if the profiler says
it's a problem.



Oliver Sturm
--
omnibus ex nihilo ducendis sufficit unum
Spaces inserted to prevent google email destruction:
MSN oliver @ sturmnet.org Jabber sturm @ amessage.de
ICQ 27142619 http://www.sturmnet.org/blog

Re: regex syntax by Jon

Jon
Thu Aug 25 03:17:52 CDT 2005

Oliver Sturm <oliver@sturmnet.org> wrote:
> >>Well, if you ask me, you should always use String.Format when putting
> >>together strings from more than two parts.
> >
> > I disagree.
>
> I guess I should have qualified my statement better. I might have added
> conditions like "and at least one of the parts is not a string in itself".

Do you have evidence that String.Format doesn't itself convert the
arguments to intermediate strings? If it does, I can't see that using
it is saving any operations.

> > You can depend on it in C# at least - it's in the specification, IIRC.
>
> I would readily assume it even without reading the specs. I would make a
> test if it were in any way important to me. Until then, I wouldn't
> depend on it.

Well, take it from me - you *can* depend on it. (That's assuming that
by "static" you mean "constant".)

> > Sometimes String.Format is more readable; sometimes it's less readable.
> > In almost all cases, readability should be the key to determining which
> > to use.
>
> Right, that was my most important point as well. But apart from
> concatenations of literal strings or variables/constants holding
> strings, I can't imagine cases where the + concatenation would be more
> readable (see above, IMO). Even in these cases I might tend to use
> String.Format because during the course of development I find it much
> easier to extend and change. I can always change it if the profiler says
> it's a problem.

In cases with a single parameter you want at the end of the string, I
think it's more readable to have:

string x = "Age: "+age;

than:

string x = string.Format("Age: {0}", age);

It's very easy to change the former to the latter if you ever *do* want
to do anything more complicated.

--
Jon Skeet - <skeet@pobox.com>
http://www.pobox.com/~skeet
If replying to the group, please do not mail me too