Email address validation!

-- Gilbert E. Detillieux E-mail: gedetil@cs.umanitoba.ca Dept. of Computer Science Web: http://cs.umanitoba.ca/~gedetil/ University of Manitoba Winnipeg MB CANADA R3T 2N2 For best service, contact cstech@cs.umanitoba.ca.

Trevor Cordes

13 Apr 13 Apr

10:59 a.m.

On 2021-04-11 Adam Thompson wrote:

...

TL;DR: you think you know how to validate an email address? You're wrong.

https://www.netmeister.org/blog/email.html

Here's my official 20-years-wisdom email validation regex I use in many projects (in utf8 encoding, php single-quote format):

(?:[\x21\x23-\x27\x2A-\x2B\x2D\x2F-\x39\x3F\x41-\x5A\x5E-\x60\x61-\x7A\x7B-\x7E\x{0080}-\x{FFFF}]+(?:\x2E[\x21\x23-\x27\x2A-\x2B\x2D\x2F-\x39\x3F\x41-\x5A\x5E-\x60\x61-\x7A\x7B-\x7E\x{0080}-\x{FFFF}]+)*|\x22(?:[\x21\x23-\x5B\x5D-\x7E\x{0080}-\x{FFFF}]|\x5C[\x20\x09\x21-\x7E\x{0080}-\x{FFFF}])*\x22)@[\x2Da-zA-Z0-9\x{0080}-\x{FFFF}]+(?:\.[\x2Da-zA-Z0-9\x{0080}-\x{FFFF}]+)+

(I actually wrote a script that closely matches the RFC5322 grammar and auto-gens the above craziness.)

Yes, the above regex makes some assumptions and errs on the side of allowing rather than blocking (to try to capture real customers while still blocking egregious fuzzers/spammers).

The page you reference is very good, and had some things I haven't thought of in a while, however, I don't think he's actually written something useful for "validation" vis a vis 2021.

Any source routing (and uucp bang) is evil and no one uses it anymore. If a customer of my site wants to try to use it, they deserve to be blocked. :-) A good rule of thumb on validation should be "if a guy is smart enough to use a really weird email address feature, they are smart enough to know why it didn't work and why they should use something simpler". We can minorly tick off 0.001% of users (wizards), that's ok, as long as we keep the 99.99% normal users from thinking. So his points 1-3 are ignorable.

Points 4-8 are the important bits I like to address, hence all the wacky utf8 craziness in my regex, with the big ones being special chars in the proper places, + feature, and "" rules. I think I capture some of the dot rules, but I do allow consecutive dots because --who cares--.

Point 8 I just noticed I only allow 2-byte UTF8, oops, as I wrote this before the db properly handled 3/4-byte. Time to update the regex! But if anyone is using poomoji as their local part...

Point 9, hard (impossible) to do in a regex as above. Just trim local and domain part and hope for the best. Anyone going over that, they know why it's failing...

Point 10, you just have to do a MX resolve if you want to avoid the fuzzers... I disagree with this guy's point 10. If you want to register your email address at web sites before you register and setup your domain... tough!

Point 11 is important, but it's pretty much already "free" with no special logic because puny just looks like a normal domain to most code.

Point 12, dotless domains... Tough! Bite me. And though there may be domains like that allowed by *domain* RFCs, do the *email* RFCs allow it?

Point 13, half-evil, but everyone's validator will already allow a raw dotted-quad, but I'm not making special rules for [] syntax. Only uber-gurus would try this, and they will know why it fails.

So I agree it's not as easy as it seems, and going down that rabbit hole will take a few evenings of work and you still won't arrive at "correct". But you can get "good enough" if you want to allow 99.99% of normal users to sign up at your site. I'm open to arguments and suggestions, as it's a bi-yearly ritual to try to improve that regex. Like adding 4-byte UTF8 support so my customers can be poo.

1501

Age (days ago)

1503

Last active (days ago)

roundtable@muug.ca

2 comments

3 participants

tags (0)

participants (3)

Adam Thompson
Gilbert E. Detillieux
Trevor Cordes