Php new domdocument utf 8

Введение

Модуль DOM позволяет вам работать с XML-документами через DOM API с PHP.

Замечание:

Модуль DOM использует кодировку UTF-8. Используйте mb_convert_encoding() , UConverter::transcode() или iconv() для работы с другими кодировками.

User Contributed Notes 2 notes

Be careful when using this for partial HTML. This will only take complete HTML documents with at least an HTML element and a BODY element. If you are working on partial HTML and you fill in the missing elements around it and don’t specify in META elements the character encoding then it will be treated as ISO-8859-1 and will mangle UTF-8 strings. Example:

$body = getHtmlBody ();
$doc = new DOMDocument ();
$doc -> loadHtml ( «» . $body . «» );
// $doc will treat your HTML ISO-8859-1.
// this is correct but may not be what you want if your source is UTF-8
?>

$body = getHtmlBody ();
$doc = new DOMDocument ();
$doc -> loadHtml ( «» . $body . «» );
// $doc will treat your HTML correctly as UTF-8.
?>

you can use below code to load html in utf-8 format. It is not enough that your encoding is in utf-8.

  • DOM
    • Введение
    • Установка и настройка
    • Предопределённые константы
    • Примеры
    • DOMAttr
    • DOMCdataSection
    • DOMCharacterData
    • DOMChildNode
    • DOMComment
    • DOMDocument
    • DOMDocumentFragment
    • DOMDocumentType
    • DOMElement
    • DOMEntity
    • DOMEntityReference
    • DOMException
    • DOMImplementation
    • DOMNamedNodeMap
    • DOMNode
    • DOMNodeList
    • DOMNotation
    • DOMParentNode
    • DOMProcessingInstruction
    • DOMText
    • DOMXPath
    • Функции DOM

    Источник

    DOMDocument::__construct

    The constuctor arguments are useful if you want to build a new document using createElement, appendChild etc.

    By contrast, these arguments are overriden as soon as you load a document from source by calling load() or loadXML().

    * If the source contains an XML declaration specifying an encoding, that encoding is used.
    * If the XML declaration does not specify an encoding, or if the source does not contain a declaration at all, UTF-8 is assumed.

    This behaviour applies no matter what you declared when you called new DOMDocument().

    Be aware using the encoding parameter in the constructor.
    It does not mean that all data is automatically encoded for you in the supplied encoding. You need to do that yourself once you choose an encoding other than the default UTF-8. See the note on DOM Functions on how to properly work with other encodings.

    The constructor example clearly shows that version and encoding only end up in the XML header.

    Make sure that php_domxml.dll on windows is removed before using the domdocument class as they cannot coexist.

    Not sure if this is what you meant when you said «The constructor example clearly shows that version and encoding only end up in the XML header», but you can also affect other parameters in the generated XML header, by accessing the DOMDocument’s properties, for example:

    $dom = new DOMDocument ( ‘1.0’ , ‘UTF-8’ );
    $dom -> xmlStandalone = false ;
    echo $dom -> saveXML ();

    domdocument::domdocument() expects at least
    At the least, I found that It due to ZEND optimizer, uninstall it,working well, but the speeds will be slowlly :-(.

    Comment :
    item 1 : 2008-10-03 17:10:58, gkrong said:
    «Warning: domdocument::domdocument() expects at least 1
    parameter»
    If you use PHP 5 in windows, you don’t need to declare
    php_domxml.dll in your php.ini file.
    so u can give comment in the line php_domxml.dll in your
    php.ini file.
    you only need to comment it out, but do not delete the
    php_domxml.dll file in the ext directory.

    If you get the error message «domdocument::domdocument() expects parameter 2 to be long, string given» for a code sample like this:

    $dom = new DOMDocument(‘1.0’, ‘UTF-8’);
    $dom->xmlStandalone = false;
    echo $dom->saveXML();

    which is obviously correct if you compare the constructor signature:

    __construct ([ string $version [, string $encoding ]] )

    make sure you’re not overwritting this dom library by another (f.e. extension=php_domxml.dll in php.ini). XAMPP f.e. delivers its standard version with php_domxml.dll which ends up in this error message

    To expand on bholbrook’s comment, if you receive this: «Warning: domdocument::domdocument() expects at least 1 parameter», it is due to the old domxml extension, which you need to disable.

    domxml overwrites DOMDocument::_construct with an alias to domxml_open_mem, so this code:
    $doc = new DOMDocument ();
    ?>
    . essentially does this:
    $dom = domxml_open_mem ();
    ?>
    . which is why PHP complains about expecting at least 1 parameter (it expects a string of XML).

    In this post http://softontherocks.blogspot.com/2014/11/descargar-el-contenido-de-una-url_11.html I found a simple way to get the content of a URL with DOMDocument, loadHTMLFile and saveHTML().

    function getURLContent($url) $doc = new DOMDocument;
    $doc->preserveWhiteSpace = FALSE;
    @$doc->loadHTMLFile($url);
    return $doc->saveHTML();
    >

    • DOMDocument
      • _​_​construct
      • createAttribute
      • createAttributeNS
      • createCDATASection
      • createComment
      • createDocumentFragment
      • createElement
      • createElementNS
      • createEntityReference
      • createProcessingInstruction
      • createTextNode
      • getElementById
      • getElementsByTagName
      • getElementsByTagNameNS
      • importNode
      • load
      • loadHTML
      • loadHTMLFile
      • loadXML
      • normalizeDocument
      • registerNodeClass
      • relaxNGValidate
      • relaxNGValidateSource
      • save
      • saveHTML
      • saveHTMLFile
      • saveXML
      • schemaValidate
      • schemaValidateSource
      • validate
      • xinclude

      Источник

      DOMDocument::loadHTML

      Функция разбирает HTML, содержащийся в строке source . В отличие от загрузки XML, HTML не должен быть правильно построенным (well-formed) документом. Эта функция также может быть вызвана статически для загрузки и создания объекта класса DOMDocument . Статический вызов может использоваться в случаях, когда нет необходимости устанавливать значения параметров объекта DOMDocument до загрузки документа.

      Список параметров

      Начиная с версии Libxml 2.6.0, можно также использовать параметр options для указания дополнительных параметров Libxml.

      Возвращаемые значения

      Возвращает true в случае успешного выполнения или false в случае возникновения ошибки. В случае статического вызова возвращает объект класса DOMDocument или false в случае возникновения ошибки.

      Ошибки

      Если через аргумент source передана пустая строка, будет сгенерировано предупреждение. Это предупреждение генерируется не libxml, поэтому оно не может быть обработано функциями обработки ошибок libxml.

      До PHP 8.0.0 метод может вызываться статически, но вызовет ошибку E_DEPRECATED . Начиная с PHP 8.0.0, вызов этого метода статически выбрасывает исключение Error .

      Несмотря на то, что некорректный HTML обычно успешно загружается, данная функция может генерировать ошибки уровня E_WARNING при обнаружении плохой разметки. Для обработки данных ошибок можно воспользоваться функциями обработки ошибок libxml.

      Примеры

      Пример #1 Создание документа

      Смотрите также

      • DOMDocument::loadHTMLFile() — Загрузка HTML из файла
      • DOMDocument::saveHTML() — Сохраняет документ из внутреннего представления в строку, используя форматирование HTML
      • DOMDocument::saveHTMLFile() — Сохраняет документ из внутреннего представления в файл, используя форматирование HTML

      User Contributed Notes 19 notes

      You can also load HTML as UTF-8 using this simple hack:

      $doc = new DOMDocument ();
      $doc -> loadHTML ( » . $html );

      // dirty fix
      foreach ( $doc -> childNodes as $item )
      if ( $item -> nodeType == XML_PI_NODE )
      $doc -> removeChild ( $item ); // remove hack
      $doc -> encoding = ‘UTF-8’ ; // insert proper

      DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.

      This isn’t well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.

      Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.

      When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get «Cạnh tranh», you will receive «Cạnh tranh». I suggest we use mb_convert_encoding before load UTF-8 page :
      $pageDom = new DomDocument ();
      $searchPage = mb_convert_encoding ( $htmlUTF8Page , ‘HTML-ENTITIES’ , «UTF-8» );
      @ $pageDom -> loadHTML ( $searchPage );

      Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html’s head section:

      If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.

      Warning: This does not function well with HTML5 elements such as SVG. Most of the advice on the Web is to turn off errors in order to have it work with HTML5.

      If we are loading html5 tags such as

      , there is following error:

      DOMDocument::loadHTML(): Tag section invalid in Entity

      We can disable standard libxml errors (and enable user error handling) using libxml_use_internal_errors(true); before loadHTML();

      This is quite useful in phpunit custom assertions as given in following example (if using phpunit test cases):

      // Create a DOMDocument
      $dom = new DOMDocument();

      // fix html5/svg errors
      libxml_use_internal_errors(true);

      // Load html
      $dom->loadHTML(» «);
      $htmlNodes = $dom->getElementsByTagName(‘section’);

      if ($htmlNodes->length == 0) $this->assertFalse(TRUE);
      > else $this->assertTrue(TRUE);
      >

      Remember: If you use an HTML5 doctype and a meta element like so

      your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by «bigtree at 29a»):

      It should be noted that when any text is provided within the body tag
      outside of a containing element, the DOMDocument will encapsulate that
      text into a paragraph tag (

      ).

      For those of you who want to get an external URL’s class element, I have 2 usefull functions. In this example we get the ‘


      elements back (search result headers) from google search:

      1. Check the URL (if it is reachable, existing)
      # URL Check
      function url_check ( $url ) <
      $headers = @ get_headers ( $url );
      return is_array ( $headers ) ? preg_match ( ‘/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/’ , $headers [ 0 ]) : false ;
      >;
      ?>

      2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)
      # Function to clean a string
      function clean ( $text ) $clean = html_entity_decode ( trim ( str_replace ( ‘;’ , ‘-‘ , preg_replace ( ‘/\s+/S’ , » » , strip_tags ( $text ))))); // remove everything
      return $clean ;
      echo ‘\n’ ; // throw a new line
      >
      ?>

      After doing that, we can output the search result headers with following method:
      $searchstring = ‘djceejay’ ;
      $url = ‘http://www.google.de/webhp#q=’ . $searchstring ;
      if( url_check ( $url )) $doc = new DomDocument ;
      $doc -> validateOnParse = true ;
      $doc -> loadHtml ( file_get_contents ( $url ));
      $output = clean ( $doc -> getElementByClass ( ‘r’ )-> textContent );
      echo $output . ‘
      ‘ ;
      >else echo ‘URL not reachable!’ ; // Throw message when URL not be called
      >
      ?>

      Be aware that this function doesn’t actually understand HTML — it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.

      For example, with input like this where the first element isn’t closed:

      loadHTML will change it to this, which is well-formed but invalid:

      Источник

      Читайте также:  Code style for javascript
Оцените статью