Vba html to string

Vba html to string

Issue: I want to be able to grab text text from an .html file (with all of it’s tags) and have it converted to normal, readable text (without the tags) like it would appear in a browser. I’m currently using the following:

Function HtmlToText(sHTML) As String Dim oDoc As HTMLDocument Set oDoc = New HTMLDocument oDoc.body.innerHTML = sHTML HtmlToText = oDoc.body.innerText End Function

I’m passing in a string which includes all of the tags, and I am getting a normal text string back out. The issue is that it is removing all of the formatting. Changes in em size, bold tags, italic tags, and underline tags are all removed and the text is just rendered plain. Is there any way to take text from an html file and get it pasted into word so that it looks the same as it looks in the browser (i.e. retaining font size changes, bold, italic, underline, etc.)? If I can get this to work, it will save hundreds of hours of work. Your help is truly appreciated.

Additional context for those that want it: We have folders with thousands of html files in them. Some of these files (several dozen) are selected to be used and the file names are marked in a spreadsheet. The content of the selected files then needs to be put into a single word document with an informational header placed above the content from each html file. This was an entirely manual process — someone would go to the folder, find each individual file, open it, copy the relevant text, type in the informational header, and then paste the text from the html file into the word doc. This was being constantly done over and over, so I wanted to see if we could automate the process. Once the files were selected and recorded in the excel doc, I wanted Word VBA to build the final doc by using the information that was in excel. The good news is that I’m 99% there. I just really need some help with the last part. So far, I can select the excel file from word, and word will create all the headers, find the correct files, get the html text, convert it to plain text, and insert it into Word. I just need it to retain the html formatting when it gets copied over.

Читайте также:  Php число восьмеричное число

Источник

VBA Excel. Парсинг сайтов, html-страниц и файлов

Пользовательская функция GetHTML1 (VBA Excel) для извлечения (парсинга) текстового содержимого из html-страницы сайта по ее URL-адресу с помощью объекта «msxml2.xmlhttp»:

Парсинг сайтов (WinHttp.WinHttpRequest.5.1)

Парсинг файлов (ADODB.Stream)

Примеры записи текста в переменную

Извлечение данных из html

В понятие «парсинг», кроме извлечения текстового содержимого сайтов, html-страниц или файлов, входит поиск и извлечение конкретных данных из всего полученного текстового содержимого. Пример извлечения email-адресов из текста, присвоенного переменной, смотрите в последнем параграфе статьи: Регулярные выражения (объекты, свойства, методы).

Парсинг содержимого тегов

Парсинг содержимого Id

Для реализации представленных здесь примеров могут понадобиться дополнительные библиотеки. В настоящее время у меня подключены следующие (к данной теме могут относиться последние шесть):

  • Visual Basic For Applications
  • Microsoft Excel 16.0 Object Library
  • OLE Automation
  • Microsoft Office 16.0 Object Library
  • Microsoft Forms 2.0 Object Library
  • Ref Edit Control
  • Microsoft Scripting Runtime
  • Microsoft Word 16.0 Object Library
  • Microsoft Windows Common Controls 6.0 (SP6)
  • Microsoft ActiveX Data Objects 6.1 Library
  • Microsoft ActiveX Data Objects Recordset 6.0 Library
  • Microsoft HTML Object Library
  • Microsoft Internet Controls
  • Microsoft Shell Controls And Automation
  • Microsoft XML, v6.0

С этим набором библиотек все примеры работают. Тестирование проводилось в VBA Excel 2016.

6 комментариев для “VBA Excel. Парсинг сайтов, html-страниц и файлов”

Доброго дня, коллеги.
Задача следующая: в элементе ВэбБраузерКонтрол на странице поиска выполняю определенный запрос и визуально убедившись в получении нужного результата начинаю парсить содержимое. Вопрос как получить текстовое содержимое элемента ВэбБраузерКонтрол в момент когда там отображена нужная информация.
Заранее спасибо.

Доброго дня!
При попытке вычислить к примеру первое вхождение, ищет все равно следующее почему то. Sub Primer1()
Dim myHtml As String, myFile As Object, myTag As Object, myTxt As String
‘Извлекаем содержимое html-страницы в переменную myHtml с помощью функции GetHTML1
myHtml = GetHTML1(«https://bik-info.ru?040702802»)
‘Создаем объект HTMLFile
Set myFile = CreateObject(«HTMLFile»)
‘Записываем в myFile текст из myHtml
myFile.body.innerHTML = myHtml
‘Присваиваем переменной myTag коллекцию одноименных тегов, имя которого
‘указанно в качестве аргумента метода getElementsByTagName
Set myTag = myFile.getElementsByTagName(«strong»)
‘Выбираем, содержимое какого тега по порядку, начинающегося с 0, нужно извлечь
myTxt = myTag(0).innerText
Debug.Print myTxt
‘Большой текст может не уместиться в MsgBox, тогда для просмотра используйте окно Immediate
‘Debug.Print myTxt
End Sub Function GetHTML1(ByVal myURL As String) As String
On Error Resume Next
With CreateObject(«msxml2.xmlhttp»)
.Open «GET», myURL, False
.send
Do: DoEvents: Loop Until .readyState = 4
GetHTML1 = .responseText
End With
End Function

Здравствуйте, Сергей!
На HTML-странице, которую вы указали, тегом strong выделены только две одинаковые даты. Предположу, что отображается первая дата. Если strong заменить на b , отобразится знак $ , который идет первый с этим тегом.

Источник

Convert html to plain text in VBA

A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.

Function StripTags(ByVal html As String) As String Dim text As String Dim accumulating As Boolean Dim n As Integer Dim c As String text = "" accumulating = True n = 1 Do While n " Then accumulating = True Else If accumulating Then text = text & c End If End If n = n + 1 Loop StripTags = text End Function 

This can leave lots of extraneous whitespace, but it will help in removing the tags.

Solution 3

Tim’s solution was great, worked liked a charm.

I´d like to contribute: Use this code to add the «Microsoft HTML Object Library» in runtime:

It worked on Windows XP and Windows 7.

Solution 4

Tim’s answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.

 Function HtmlToText(sHTML) As String Dim oDoc As HTMLDocument If IsNull(sHTML) Then HtmlToText = "" Exit Function End-If Set oDoc = New HTMLDocument oDoc.body.innerHTML = sHTML HtmlToText = oDoc.body.innerText End Function 

Solution 5

Yes! I managed to solve my problem as well. Thanks everybody/

In my case, I had this sort of input:

Lorem ipsum dolor sit amet.

Ut enim ad minim veniam.

Duis aute irure dolor in reprehenderit.

And I did not want the result to be all jammed together without breaklines.

So I first splitted my input for every

tag into an array ‘paragraphs’, then for each element I used Tim’s answer to get the text out of html (very sweet answer btw).

In addition I concatenated each cleaned ‘paragraph’ with this breakline character Crh(10) for VBA/Excel.

The final code is:

Public Function HtmlToText(ByVal sHTML As String) As String Dim oDoc As HTMLDocument Dim result As String Dim paragraphs() As String If IsNull(sHTML) Then HtmlToText = "" Exit Function End If result = "" paragraphs = Split(sHTML, "

") For Each paragraph In paragraphs Set oDoc = New HTMLDocument oDoc.body.innerHTML = paragraph result = result & Chr(10) & Chr(10) & oDoc.body.innerText Next paragraph HtmlToText = result End Function

Источник

Convert html to plain text in VBA

A very simple way to extract text is to scan the HTML character by character, and accumulate characters outside of angle brackets into a new string.

Function StripTags(ByVal html As String) As String Dim text As String Dim accumulating As Boolean Dim n As Integer Dim c As String text = "" accumulating = True n = 1 Do While n " Then accumulating = True Else If accumulating Then text = text & c End If End If n = n + 1 Loop StripTags = text End Function 

This can leave lots of extraneous whitespace, but it will help in removing the tags.

Solution 3

Tim’s solution was great, worked liked a charm.

I´d like to contribute: Use this code to add the «Microsoft HTML Object Library» in runtime:

It worked on Windows XP and Windows 7.

Solution 4

Tim’s answer is excellent. However, a minor adjustment can be added to avoid one foreseeable error response.

 Function HtmlToText(sHTML) As String Dim oDoc As HTMLDocument If IsNull(sHTML) Then HtmlToText = "" Exit Function End-If Set oDoc = New HTMLDocument oDoc.body.innerHTML = sHTML HtmlToText = oDoc.body.innerText End Function 

Solution 5

Yes! I managed to solve my problem as well. Thanks everybody/

In my case, I had this sort of input:

Lorem ipsum dolor sit amet.

Ut enim ad minim veniam.

Duis aute irure dolor in reprehenderit.

And I did not want the result to be all jammed together without breaklines.

So I first splitted my input for every

tag into an array ‘paragraphs’, then for each element I used Tim’s answer to get the text out of html (very sweet answer btw).

In addition I concatenated each cleaned ‘paragraph’ with this breakline character Crh(10) for VBA/Excel.

The final code is:

Public Function HtmlToText(ByVal sHTML As String) As String Dim oDoc As HTMLDocument Dim result As String Dim paragraphs() As String If IsNull(sHTML) Then HtmlToText = "" Exit Function End If result = "" paragraphs = Split(sHTML, "

") For Each paragraph In paragraphs Set oDoc = New HTMLDocument oDoc.body.innerHTML = paragraph result = result & Chr(10) & Chr(10) & oDoc.body.innerText Next paragraph HtmlToText = result End Function

Источник

Оцените статью