Парсинг html google script

Содержание

How to parse an html string in google apps script without using xmlservice?
Method 1: RegExp and split
Method 2: DOMParser API
Method 3: JQuery Parse HTML
Method 4: Regular Expression and Replace
Парсинг (Скрапинг) с помощью Google Apps Script
Parsing HTML
tanaikech / submit.md

How to parse an html string in google apps script without using xmlservice?

Parsing an HTML string in Google Apps Script can sometimes prove to be a challenging task, especially if the XmlService is not available as an option. However, there are other methods to achieve this goal. In this article, we will explore various methods to parse an HTML string in Google Apps Script without the use of XmlService.

Method 1: RegExp and split

Here is a step-by-step guide on how to parse an HTML string in Google Apps Script without using XmlService, using only RegExp and split:

var html token tag">div>p>Hello world!p>div>";

This regular expression matches any string that starts with a «» character, and does not contain any other «» characters in between.

Use the split() method to split the HTML string into an array of strings, using the regular expression as the delimiter:

This will split the HTML string into an array of strings, where each string is a piece of text between two HTML tags.

var text = tags.filter(function(str) < return str.trim().length >0; >);

This will remove any empty strings from the array, leaving only the text between the HTML tags.

This will join all the text strings into a single string, which is the parsed HTML without any tags.

Here is the complete code example:

var html token tag">div>p>Hello world!p>div>"; var regex = /[^>]+>/g; var tags = html.split(regex); var text = tags.filter(function(str) < return str.trim().length >0; >); var result = text.join("");

This code example should work for most HTML strings, but may not work for more complex HTML with nested tags or attributes.

Method 2: DOMParser API

To parse an HTML string in Google Apps Script using the DOMParser API, you can follow these steps:

const parser = new DOMParser();

Use the parseFromString method of the DOMParser API to parse the HTML string. This method takes two arguments: the HTML string to parse and the MIME type of the document being parsed.

const htmlString = "Hello, world!
"; const mimeType = "text/html"; const parsedHtml = parser.parseFromString(htmlString, mimeType);

You can now access the parsed HTML as a DOM tree. For example, to get the text content of the p element, you can use the textContent property.

const pElement = parsedHtml.querySelector("p"); const textContent = pElement.textContent; // "Hello, world!"

Here’s the complete code example:

const parser = new DOMParser(); const htmlString = "Hello, world!
"; const mimeType = "text/html"; const parsedHtml = parser.parseFromString(htmlString, mimeType); const pElement = parsedHtml.querySelector("p"); const textContent = pElement.textContent; // "Hello, world!"

You can use this method to parse any valid HTML string without using the XmlService API in Google Apps Script.

Method 3: JQuery Parse HTML

To parse an HTML string in Google Apps Script using JQuery Parse HTML, you can follow these steps:

Load the JQuery library in your Google Apps Script project. You can do this by going to the «Resources» menu, selecting «Libraries», and searching for «JQuery». Choose the latest version and save it.
Create a variable to hold your HTML string. For example:

var htmlString = "Hello World!
";

var parsedHTML = $.parseHTML(htmlString);

You can now manipulate the parsed HTML as a DOM element using JQuery or vanilla JavaScript. For example:

$(parsedHTML).find('p').text('Hello Google Apps Script!');

This will change the text inside the

tag to «Hello Google Apps Script!».

Here’s the full code example:

// Load the JQuery library function loadJQuery()  var libraryUrl = 'https://ajax.googleapis.com/ajax/libs/jquery/3.5.1/jquery.min.js'; var response = UrlFetchApp.fetch(libraryUrl); eval(response.getContentText()); > // Parse the HTML string function parseHTML()  // Load JQuery loadJQuery(); // HTML string to parse var htmlString = "Hello World!
"; // Parse the HTML string into a DOM element var parsedHTML = $.parseHTML(htmlString); // Manipulate the parsed HTML $(parsedHTML).find('p').text('Hello Google Apps Script!'); // Log the manipulated HTML Logger.log($(parsedHTML).html()); >

Method 4: Regular Expression and Replace

To parse an HTML string in Google Apps Script without using XmlService, you can use Regular Expression and Replace. Here are the steps to do so:

First, create a regular expression pattern to match the HTML tags in the string. Here is an example pattern:

This pattern matches any string that starts with < and ends with >, and has any number of characters in between that are not > .

var htmlString = "Hello, world!
"; var plainText = htmlString.replace(pattern, "");

This code removes all the HTML tags from the htmlString variable and assigns the resulting plain text to the plainText variable.

If you want to preserve some of the text inside the HTML tags, you can modify the regular expression pattern to capture those parts. Here is an example pattern that captures the text inside
tags:

This pattern matches any string that starts with

and ends with

, and captures any number of characters in between that are not

You can then use the replace function with a callback function to replace each match with the captured text. Here is an example code:

var htmlString = "Hello, world!
"; var plainText = htmlString.replace(pattern, function(match, text)  return text; >);

This code removes all the

tags from the htmlString variable and assigns the resulting plain text to the plainText variable.

These are the basic steps to parse an HTML string in Google Apps Script without using XmlService using Regular Expression and Replace.

Источник

Парсинг (Скрапинг) с помощью Google Apps Script

Скрипт, представленный ниже, позволяют автоматически находить и выгружать информацию об объявлениях на одной из бирж фриланса.

Имя домена берётся со страницы Google Spread Sheets. Туда же выгружаются результаты поиска.

Функция getBlock находит часть html-кода (блок кода) внутри тега (обычно по уникальному значению атрибутов этого тега), и возвращает этот блок в виде строкового значения (без самого тега!);
Функция deleteBlock наоборот, удаляет найденный фрагмент html-кода внутри блока и также возвращает оставшуюся часть этого блока в виде строкового значения.
В отличии от первых двух функций, функция getOpenTag не удаляет найденный тег, а возвращает его в виде строкового знечения. Правде, не весь тег, а только первую (открывающую часть) этого тега.

function scraper() < const ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Sheet1'); const urlDomain = ss.getRange(1, 1).getValue(); let url = urlDomain; let count = 0; for (let page = 1; page < 5; page++) < url = urlDomain + page + '/'; if (page == 1) url = urlDomain; let response = UrlFetchApp.fetch(url); ss.getRange(2, 1).setValue(response.getResponseCode()); let html = response.getContentText(); let p = 0; while (true) < let out = getBlock(html, 'div', html.indexOf('class="JobSearchCard-primary"', p)); let block = out[0]; p = out[1] + 1; if (p == 0) break; let title1 = getBlock(block, 'div', 0)[0]; let title = getBlock(title1, 'a', 0)[0]; let link = getOpenTag(title1, 'a', 0); link = getAttrName(link, 'href', 0) let formula = '=HYPERLINK("https://www.freelancer.com' +link + '", "' + title + '")'; ss.getRange(3 + 3 * count, 2).setValue(formula); let price = getBlock(block, 'div', block.indexOf('class="JobSearchCard-primary-price'))[0]; if (price.includes('span')) price = deleteBlock(price, 'span', price.indexOf('span')); ss.getRange(3 + 3 * count + 1, 2).setValue(price).setHorizontalAlignment('right'); let description = getBlock(block, 'p', block.indexOf('class="JobSearchCard-primary-description"'))[0]; ss.getRange(3 + 3 * count, 1, 3).mergeVertically().setValue(description) .setBorder(true, true, true, true, null, null, '#000000', SpreadsheetApp.BorderStyle.SOLID) .setVerticalAlignment('middle') .setWrapStrategy(SpreadsheetApp.WrapStrategy.WRAP); ss.getRange(3 + 3 * count, 2, 3).setBorder(true, true, true, true, null, null, '#000000', SpreadsheetApp.BorderStyle.SOLID); let cat = getBlock(block, 'div', block.indexOf('class="JobSearchCard-primary-tags"'))[0]; cat = cat.split('').map(item => item.split('>')[1]); cat.pop(); cat = cat.join(', '); ss.getRange(3 + 3 * count + 2, 2).setValue(cat); count++; >; >; > function getAttrName(html, attr, i) < let idxStart = html.indexOf(attr +'=' , i); if (idxStart == -1) return "Can't to find attr " + attr + ' !'; idxStart = html.indexOf('"' , idxStart) + 1; let idxEnd = html.indexOf('"' , idxStart); return html.slice(idxStart,idxEnd).trim(); >function getOpenTag(html, tag, idxStart) < let openTag = '; // begin loop after openTag let idxEnd = html.indexOf('>', idxStart) + 1; if (idxStart == -1) return "Can't to find closing bracket '>' for openTag!"; return html.slice(idxStart,idxEnd).trim(); > function deleteBlock(html, tag, idxStart) < // delete opening & closing tag and info between them let openTag = ''; let lenCloseTag = closeTag.length; let countCloseTags = 0; let iMax = html.length; let idxEnd = 0; // where we are? if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) < idxStart = html.lastIndexOf(openTag, idxStart); if (idxStart == -1) return ["Can't to find openTag " + openTag + ' !', -1]; >; // begin loop after openTag let i = html.indexOf('>') + 1; while (i ; let carrentValue = html[i]; if (html[i] === ' <')< let closingTag = html.slice(i, i + lenCloseTag); let openingTag = html.slice(i, i + lenOpenTag); if (html.slice(i, i + lenCloseTag) === closeTag) < if (countCloseTags === 0) < idxEnd = i + lenCloseTag; break; >else < countCloseTags -= 1; >; > else if (html.slice(i, i + lenOpenTag) === openTag) < countCloseTags += 1; >; >; >; return (html.slice(0, idxStart) + html.slice(idxEnd, iMax)).trim(); > function getBlock(html, tag, idxStart) < // Block let openTag = ''; let lenCloseTag = closeTag.length; let countCloseTags = 0; let iMax = html.length; let idxEnd = 0; // where we are? if (html.slice(idxStart, idxStart + lenOpenTag) != openTag) < idxStart = html.lastIndexOf(openTag, idxStart); if (idxStart == -1) return ["Can't to find openTag " + openTag + ' !', -1]; >; // change start - will start after openTag! idxStart = html.indexOf('>', idxStart) + 1; let i = idxStart; while (i ; let carrentValue = html[i]; if (html[i] === ' <')< let closingTag = html.slice(i, i + lenCloseTag); let openingTag = html.slice(i, i + lenOpenTag); if (html.slice(i, i + lenCloseTag) === closeTag) < if (countCloseTags === 0) < idxEnd = i - 1; break; >else < countCloseTags -= 1; >; > else if (html.slice(i, i + lenOpenTag) === openTag) < countCloseTags += 1; >; >; >; return [html.slice(idxStart,idxEnd + 1).trim(), idxEnd]; >

Более продробную информацию вы сможете найти в этом видео:

Источник

Parsing HTML

The XML Service can be used to parse HTML. But it can be a bit cumbersome to navigate through the DOM tree.

In the examples below we will see how to make that easier with things like getElementById(), getElementsByClassName(), getElementsByTagName().

For example, with a few lines of code, you could grab the menu of a Wikipedia page to display it through an Apps Script web app.

var html = UrlFetchApp.fetch(‘http://en.wikipedia.org/wiki/Document_Object_Model’).getContentText();

var doc = XmlService.parse(html);

var html = doc.getRootElement();

var menu = getElementsByClassName(html, ‘vertical-navbox nowraplinks’)[0];

var output = XmlService.getRawFormat().format(menu);

1. We fetch the HTML through UrlFetch
2. We use the XMLService to parse this HTML
3. Then we can use a specific function to grab the element we want in the DOM tree (like getElementsByClassName)
4. And we convert back this element to HTML

Or we could get all the links / anchors available in this menu and display them

var html = UrlFetchApp.fetch(‘http://en.wikipedia.org/wiki/Document_Object_Model’).getContentText();

var doc = XmlService.parse(html);

var html = doc.getRootElement();

var menu = getElementsByClassName(html, ‘vertical-navbox nowraplinks’)[0];

var linksInMenu = getElementsByTagName(menu, ‘a’);

for(i in linksInMenu) output+= XmlService.getRawFormat().format(linksInMenu[i])+’
‘;

Источник

tanaikech / submit.md

This is a sample script for parsing HTML using Google Apps Script. When HTML data is converted to Google Document, the HTML data can be parsed and be converted to Google Document. In this case, the paragraphs, lists and tables are included. From this situation, I thought that this situation can be used for parsing HTML using Google Apps Script. So I could came up with this method.

In the Sheet API, the HTML data can be put to the Spreadsheet with the PasteDataRequest. But unfortunately, in this case, I couldn’t distinguish between the body and tables.

The flow of this method is as follows. In this sample script, the tables from HTML are retrieved.

Retrieve HTML data using UrlFetchApp.fetch() .
Create new Google Document by converting HTML data to Google Document using Drive API.
- This is a temporal file.
Retrieve all tables using Document service of Google Apps Script.
Delete the temporal file.

Before you run this script, please enable Drive API at Advanced Google Services.

function parseTablesFromHTML(url)  var html = UrlFetchApp.fetch(url); var docId = Drive.Files.insert(  title: "temporalDocument", mimeType: MimeType.GOOGLE_DOCS >, html.getBlob() ).id; var tables = DocumentApp.openById(docId) .getBody() .getTables(); var res = tables.map(function(table)  var values = []; for (var row = 0; row  table.getNumRows(); row++)  var temp = []; var cols = table.getRow(row); for (var col = 0; col  cols.getNumCells(); col++)  temp.push(cols.getCell(col).getText()); > values.push(temp); > return values; >); Drive.Files.remove(docId); return res; > // Please run this function. function run()  var url = "###"; // var res = parseTablesFromHTML(url); Logger.log(res); >

As a test case, when you set https://gist.github.com/tanaikech/f52e391b68473cbf6d4ab16108dcfbbb to url and run the script, the following result can be retrieved.

[ [ ["head1_1", "head1_2", "head1_3\n"], ["value1_a1", "value1_b1", "value1_c1"], ["value1_a2", "value1_b2", "value1_c2"] ], [ ["head2_1", "head2_2", "head2_3\n"], ["value2_a1", "value2_b1", "value2_c1"], ["value2_a2", "value2_b2", "value2_c2"] ] ]

Using this method, all paragraphs and lists can be also retrieved.
This method can be also used with other languages.

Источник

Читайте также: My New Webpage