Html code to find text

Find word in HTML

It works well on words that are not inside HTML tags. What I want is to ignore those that are inside HTML tags. Example: find(«spain»)
Input:

The rain in Spain stays mainly in the plain. 
The rain in Spain stays mainly in the plain. 

Why did you come to the conclusion that you need regular expressions? You state the problem, and we’ll take care of the solution.

4 Answers 4

To account for html tags and attributes that could match, you are going to need to parse that HTML one way or another. The easiest way is to add it to the DOM (or just to a new element):

var container = document.createElement("div"); container.style.display = "none"; document.body.appendChild(container); // this step is optional container.innerHTML = where; 

Once parsed, you can now iterate the nodes using DOM methods and find just the text nodes and search on those. Use a recursive function to walk the nodes:

function wrapWord(el, word) < var expr = new RegExp(word, "i"); var nodes = [].slice.call(el.childNodes, 0); for (var i = 0; i < nodes.length; i++) < var node = nodes[i]; if (node.nodeType == 3) // textNode < var matches = node.nodeValue.match(expr); if (matches) < var parts = node.nodeValue.split(expr); for (var n = 0; n < parts.length; n++) < if (n) < var span = el.insertBefore(document.createElement("span"), node); span.appendChild(document.createTextNode(matches[n - 1])); >if (parts[n]) < el.insertBefore(document.createTextNode(parts[n]), node); >> el.removeChild(node); > > else < wrapWord(node, word); >> > 

Hi Gilly. Thanks. It looks good, the only problem I’m having with this is that once I have the HTML string I want, I have to replace everything that’s in the body (as the body’s content I want to search) with the changed string.

Читайте также:  Text search with javascript

gilly, be careful, if you throw another «spain» in there, look what happens to your example: jsfiddle.net/vol7ron/J8JJm/1

@Francisc — Don’t worry about HTML strings at all, just pass document.body to the recursive function. Eg, call wrapWord(document.body, «spain») . The text will be replaced in place.

You won’t be able to process HTML in any reliable way using regex. Instead, parse the HTML into a DOM tree and iterate the Text nodes checking their data for content.

If you are using JavaScript in a web browser, the parsing will have already have been done for you. See this question for example wrap-word-in-span code. It’s much trickier if you need to match phrases that might be split across different elements.

Thank you. Is there anyway to replace on-page content or do I have to get the body element, pass it to the function then replace all of it with the parsed string, which seems bad.

The example code in the linked answer replaces the matched content with a wrapped version of that match. If you want, you could replace with something else by changing the code inside the function() < . >passed to findText , for example by using replaceChild instead of insertBefore . Indeed you don’t want to be touching HTML strings.

function find(what:String,where:String) < what = what.replace(/(\[|\\|\^|\$|\.|\||\?|\*|\+|\(|\)|\<|\>)/g, "\\$1") .replace(/[^a-zA-Z0-9\s:;'"~[\]\\-_+=(),.<>*\/!@#$%^&|\\?]/g, "(?:&[0-9A-Za-z];|?|[^\s<])") .replace(//g,">?").replace(/"/g,"(?:\"|"?)") .replace(/\s/g, "(?:\\s| ?)"); what = "(>[^<]*|^[^<]*)(" + what + ")"; var regexp:RegExp=new RegExp(what,'gi'); return where.replace(regexp,'$1$2'); > 
  1. The first replace function adds a backslash before characters which have a special meaning in a RE, to prevent errors or unexpected results.
  2. The second replace function replaces every occurrence of unknown characters in the search query by (?:&[0-9A-Za-z];|?|[^\s <]) . This RE consists of three parts: First, it tries to match a HTML entity. Second, it attempts to match a HTML numeric entity. Finally, it matches any non-whitespace character (in case the creator of the HTML document didn't properly encode the characters).
  3. The third, fourth and fifth replace functions replaces < , >and » by the corresponding HTML entities, so that the search query will not search through tags.
  4. The sixth replace function replaces white-space by a RE ( \s| ? ), which match white-space characters and the HTML entity.

The only shortcoming of this function is that undocumented special characters (such as € ) match any HTML entity/character (following the example, not only &euro ; and € are valid matches, but also £ and @ ).

This proposed solution suits in most cases. It can be inaccurate in complex situations, which is probably not worse than a DOM iteration (which is very susceptible to memory leaks and requires more computing power).

When you work with HTML elements which have Event listeners assigned through DOM, you should iterate through all (child) elements, and apply this function to every Text node.

Источник

find words in html page with javascript

To find the element that word exists in, you’d have to traverse the entire tree looking in just the text nodes, applying the same test as above. Once you find the word in a text node, return the parent of that node.

var word = "foo", queue = [document.body], curr ; while (curr = queue.pop()) < if (!curr.textContent.match(word)) continue; for (var i = 0; i < curr.childNodes.length; ++i) < switch (curr.childNodes[i].nodeType) < case Node.TEXT_NODE : // 3 if (curr.childNodes[i].textContent.match(word)) < console.log("Found!"); console.log(curr); // you might want to end your search here. >break; case Node.ELEMENT_NODE : // 1 queue.push(curr.childNodes[i]); break; > > > 

this works in Firefox, no promises for IE.

What it does is start with the body element and check to see if the word exists inside that element. If it doesn’t, then that’s it, and the search stops there. If it is in the body element, then it loops through all the immediate children of the body. If it finds a text node, then see if the word is in that text node. If it finds an element, then push that into the queue. Keep on going until you’ve either found the word or there’s no more elements to search.

the innerText won’t include any tag names, just the value of the text nodes, so you’ll be safe there.

Heads up, @nickf: I think you forgot that the innerText property is not supported in FF and some other browsers. You might want to substitute it with ‘textContent’ in those cases. Still, +1 😉

You can iterate through DOM elements, looking for a substring within them. Neither fast nor elegant, but for small HTML might work well enough.

I’d try something recursive, like: (code not tested)

findText(node, text) < if(node.childNodes.length==0) var matchingNodes = new Array(); for(child in node.childNodes) < matchingNodes.concat(findText(child, text)); >return matchingNodes; > 

You can try using XPath, it’s fast and accurate

Also if XPath is a bit more complicated, then you can try any javascript library like jQuery that hides the boilerplate code and makes it easier to express about what you want found.

Also, as from IE8 and the next Firefox 3.5 , there is also Selectors API implemented. All you need to do is use CSS to express what to search for.

You can probably read the body of the document tree and perform simple string tests on it fast enough without having to go far beyond that — it depends a bit on the HTML you are working with, though — how much control do you have over the pages? If you are working within a site you control, you can probably focus your search on the parts of the page likely to be different page from page, if you are working with other people’s pages you’ve got a tougher job on your hands simply because you don’t necessarily know what content you need to test against.

Again, if you are going to search the same page multiple times and your data set is large it may be worth creating some kind of index in memory, whereas if you are only going to search for a few words or use smaller documents its probably not worth the time and complexity to build that.

Probably the best thing to do is to get some sample documents that you feel will be representative and just do a whole lot of prototyping based around the approaches people have offered here.

form.addEventListener("submit", (e) => < e.preventDefault(); var keyword = document.getElementById("search_input"); let words = keyword.value; var word = words, queue = [document.body], curr; while (curr = queue.pop()) < if (!curr.textContent.toUpperCase().match(word.toUpperCase())) continue; for (var i = 0; i < curr.childNodes.length; ++i) < switch (curr.childNodes[i].nodeType) < case Node.TEXT_NODE: // 3 if (curr.childNodes[i].textContent.toUpperCase().match(word.toUpperCase())) < console.log("Found!"); console.log(curr); curr.scrollIntoView(); >break; case Node.ELEMENT_NODE: // 1 queue.push(curr.childNodes[i]); break; > > > 

Linked

Hot Network Questions

Subscribe to RSS

To subscribe to this RSS feed, copy and paste this URL into your RSS reader.

Site design / logo © 2023 Stack Exchange Inc; user contributions licensed under CC BY-SA . rev 2023.7.27.43548

By clicking “Accept all cookies”, you agree Stack Exchange can store cookies on your device and disclose information in accordance with our Cookie Policy.

Источник

Search for Text in Loaded HTML document

Page when Contents are displayed

I have a web page on which my sidebar links will cause an ‘external’ HTML document to be loaded into a content div.
However after it is successfully loaded and displayed, the loaded HTML content does not appear in the Page Source.
Regardless, I now need to do a client-side Text Search of that ‘external’ HTML document using a Javascript function. My webpage looks like the following: The Search textbox and button are ‘outside’ of the Content Div (bordered in Red). And, at the time that one of the link’s HTML documents is appearing on-screen the page source looks like:

Notice that the ‘loaded’ HTML document is not showing. I have found a Javascript function findInPage() which looks promising, but it is not finding the ‘loaded’ HTML document and its text.

// ===================================== function findInPage() < var str = document.getElementById("ButtonForm").elements["txtSearch"].value; var n = 0; var txt, i, found; if (str == "") return false; // Find next occurance of the given string on the page, wrap around to the // start of the page if necessary. if (window.find) < // Look for match starting at the current point. If not found, rewind // back to the first match. if (!window.find(str)) < while (window.find(str, false, true)) n++; >else < n++; >// If not found in either direction, give message. if (n == 0) alert("Not found."); > else if (window.document.body.createTextRange) < txt = window.document.body.createTextRange(); // Find the nth match from the top of the page. found = true; i = 0; while (found === true && i i++; > // If found, mark it and scroll it into view. if (found) < txt.moveStart("character", -1); txt.findText(str); txt.select(); txt.scrollIntoView(); n++; >else < // Otherwise, start over at the top of the page and find first match. if (n >0) < n = 0; findInPage(str); >// Not found anywhere, give message. else alert("Not found."); > > return false; > 

Is there some way to modify the function and/or use a different function such that it can find the ‘loaded’ HTML document and search it for the entered Text?

Источник

Оцените статью