Regular Expression to find URLs in block of Text (Javascript)
I need a Javascript regular expression that scans a block of plain text and returns the text with the URLs as links. This is what i have:
findLinks: function(s) < var hlink = /\s(ht|f)tp:\/\/([^ \,\;\:\!\)\(\"\'\\f\n\r\t\v])+/g; return (s.replace(hlink, function($0, $1, $2) < s = $0.substring(1, $0.length); while (s.length >0 && s.charAt(s.length - 1) == '.') s = s.substring(0, s.length - 1); return ' ' + s + ''; >)); >
the problem is that it will only match http://www.google.com and NOT google.com/adsense How could I accomplish both?
4 Answers 4
I use this a as reference all the time. This guy has 8 regex’s you should know.
Here is what he uses to look for URL’s
He also breaks down what each part does. Very useful for learning regex’s and not just getting an answer that works for reasons you don’t understand.
Email validation with regex is no trivial matter. I think this is more for learning than for using in hardcore production environments. However the URL pattern has worked well for me. Obviously it’s going to need adjustments if your flavor of regex differs.
This is a non-trivial task. To match any URI that is valid according to the relevant RFCs you need a monumentally complex regular expression, and even then that won’t filter out URIs with invalid top-level domains (e.g. http://brussels.sprout/). So, you have to compromise. Determine what’s important to you (examples: are false positives or false negatives more acceptable? Do you want to limit top-level domains to only those that currently exist? Do you allow non-Latin characters in matched URIs?) You should decide what you need you regular expression to do and design it accordingly rather than blindly copying and pasting an example from the web.
You could make the protocol part optional:
Try this (works with your sample text)
Linked
Related
Hot Network Questions
Subscribe to RSS
To subscribe to this RSS feed, copy and paste this URL into your RSS reader.
Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.17.43537
By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.
Javascript: find URLs in a document
how do I find URLs (i.e. www.domain.com) within a document, and put those within anchors: < a href="www.domain.com" >www.domain.com < /a >html:
Hey dude, check out this link www.google.com and www.yahoo.com!
2 Answers 2
Firstly, www.domain.com isn’t a URL, it’s a hostname, and
won’t work — it’ll look for a .com file called www.domain relative to the current page.
It’s not possible to highlight hostnames in the general case because almost anything can be a hostname. You could try to highlight ‘www.something.dot.separated.words’, but it’s not really that reliable and there are many sites that don’t use the www. hostname prefix. I’d try to avoid that.
This is an very liberal pattern you could use as a starting point for detecting HTTP URLs. Depending on what sort of input you’ve got you may want to narrow down what it allows, and it may be worth detecting trailing characters like . or ! that would be valid parts of the URL but in practice generally aren’t.
(You could use a | to allow either the URL syntax or the www.hostname syntax, if you like.)
Anyhow, once you’ve settled on your preferred pattern you’ll need to find that pattern in text nodes on the page. Don’t run the regexp over innerHTML markup. You’ll end up completely ruining the page by trying to mark up every href=»http://something» that’s already inside markup. You’ll also destroy any existing JavaScript references, events or form field values when you replace the innerHTML content.
In general regexp simply cannot process HTML in any reliable way. So take advantage of the fact that the browser has already parsed the HTML into elements and text nodes, and just look at the text nodes. You’ll also want to avoid looking inside elements, since marking up a URL as a link when it’s already in a link is silly (and invalid).
// Mark up `http://. ` text in an element and its descendants as links. // function addLinks(element) < var urlpattern= /\bhttps?:\/\/[^\s<>"`<>|\^\[\]\\]+/g; findTextExceptInLinks(element, urlpattern, function(node, match) < node.splitText(match.index+match[0].length); var a= document.createElement('a'); a.href= match[0]; a.appendChild(node.splitText(match.index)); node.parentNode.insertBefore(a, node.nextSibling); >); > // Find text in descendents of an element, in reverse document order // pattern must be a regexp with global flag // function findTextExceptInLinks(element, pattern, callback) < for (var childi= element.childNodes.length; childi-->0;) < var child= element.childNodes[childi]; if (child.nodeType===Node.ELEMENT_NODE) < if (child.tagName.toLowerCase()!=='a') findTextExceptInLinks(child, pattern, callback); >else if (child.nodeType===Node.TEXT_NODE) < var matches= []; var match; while (match= pattern.exec(child.data)) matches.push(match); for (var i= matches.length; i-->0;) callback.call(window, child, matches[i]); > > >
regex to find url in a text
I looked into this issue last year and developed a solution that you may want to look at — See: URL Linkification (HTTP/FTP) This link is a test page for the Javascript solution with many examples of difficult-to-linkify URLs.
My regex solution, written for both PHP and Javascript — is not simple (but neither is the problem as it turns out.) For more information I would recommend also reading:
The comments following Jeff’s blog post are a must read if you want to do this right.
Note that this question gets asked a lot. Maybe do a search next time 🙂
Thanks for making this available, I found it very useful. Any chance you’ve come up with a similarly robust regEx that finds urls without the leading ‘http://’, like ‘www.example.com’?.
You can’t do this perfectly with a regular expression. You may be interested in this blog post. There is a bit more information on Regex Guru, but even those look very fragile. You will need to have additional checks outside of your regular expression to catch the edge cases.
I think it would be more accurate to say that you can’t do this perfectly and you can’t do it with regex alone. FWIW, Stack Overflow’s WMD editor uses a similar solution to the one Jeff Atwood describes in your first link, using a combination of a regex and various checks. Like I said, it can’t be perfect but for lack of a better solution you might as well use something that will match 99.9% of the time.
Interesting stuff, but I’d say that the blanket comment «can’t do this» is a little strong. More like, «can do this 99% of the time» 🙂
Identifying URLs is tricky because they are often surrounded by punctuation marks and because users frequently do not use the full form of the URL. Many JavaScript functions exist for replacing URLs with hyperlinks, but I was unable to find one that works as well as the urlize filter in the Python-based web framework Django. I therefore ported Django’s urlize function to JavaScript: https://github.com/ljosa/urlize.js
It actually would not pick up the URL in your example because there is a colon right before the URL. But if we modify the example a little:
Note the second argument which, if true, inserts rel=»nofollow» and the third argument which, if true, quotes characters that have special meaning in HTML.
i am using this regex : 🙂 ( its translated ABNF )
[a-zA-Z]([a-zA-Z]|6|\+|\-|\.)*:\/\/((([a-zA-Z]|9|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:)*@)?(\[((([0-9A-Fa-f]:)([0-9A-Fa-f]:[0-9A-Fa-f]|(252|238|19|28|8)\.(252|234|15|44|7)\.(252|243|18|92|8)\.(252|219|18|37|4))|::([0-9A-Fa-f]:)([0-9A-Fa-f]:[0-9A-Fa-f]|(252|249|11|92|9)\.(251|217|17|65|2)\.(251|238|12|44|7)\.(254|229|19|51|9))|([0-9A-Fa-f]). ([0-9A-Fa-f]:)([0-9A-Fa-f]:[0-9A-Fa-f]|(253|241|19|15|1)\.(254|235|15|38|2)\.(253|221|13|67|2)\.(253|217|19|36|5))|(([0-9A-Fa-f]:)[0-9A-Fa-f]). ([0-9A-Fa-f]:)([0-9A-Fa-f]:[0-9A-Fa-f]|(253|211|14|19|6)\.(253|238|15|11|5)\.(251|227|12|66|2)\.(255|246|19|47|1))|(([0-9A-Fa-f]:)[0-9A-Fa-f]). ([0-9A-Fa-f]:)([0-9A-Fa-f]:[0-9A-Fa-f]|(253|229|17|85|2)\.(253|221|12|85|2)\.(255|217|15|67|4)\.(255|248|17|13|5))|(([0-9A-Fa-f]:)[0-9A-Fa-f]). [0-9A-Fa-f]:([0-9A-Fa-f]:[0-9A-Fa-f]|(251|224|17|35|8)\.(252|247|13|63|6)\.(255|239|18|11|2)\.(254|245|17|17|7))|(([0-9A-Fa-f]:)[0-9A-Fa-f]). ([0-9A-Fa-f]:[0-9A-Fa-f]|(253|211|18|26|4)\.(255|235|16|13|6)\.(252|212|11|58|9)\.(252|249|11|34|5))|(([0-9A-Fa-f]:)[0-9A-Fa-f]). [0-9A-Fa-f]|(([0-9A-Fa-f]:)[0-9A-Fa-f]). )|v[0-9A-Fa-f]\.(([a-zA-Z]|6|-|\.|_|~)|[!$&'\(\)\*\+,;=]|:))\]|(255|242|18|28|8)\.(254|224|16|21|1)\.(253|211|17|64|8)\.(252|236|18|72|3)|(([a-zA-Z]|1|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=])*)(:8*)?(((\/(([a-zA-Z]|8|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)*)*|\/((([a-zA-Z]|3|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)(\/(([a-zA-Z]|8|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)*)*)?|(([a-zA-Z]|3|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)(\/(([a-zA-Z]|6|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)*)*|(([a-zA-Z]|5|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|@)(\/(([a-zA-Z]|4|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)*)*))?\/?(\?((([a-zA-Z]|2|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)|\/|\?)*)?(\#((([a-zA-Z]|1|-|\.|_|~)|%[0-9A-Fa-f][0-9A-Fa-f]|[!$&'\(\)\*\+,;=]|:|@)|\/|\?)*)?