Php parse html headers

Содержание

kopiro / parse-headers.php
Parsing Response Headers in PHP
File- functions
How the benchmarks was performed
cURL
Update based on your comments
Conclusion
gonejack / $http_response_header_parser.php

kopiro / parse-headers.php

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters

$ headers = [];

foreach (explode(«\r\n», substr( $ response , 0 , strpos( $ response , «\r\n\r\n»))) as $ i => $ line )

if ( $ i === 0 ) continue ;

list ( $ key , $ value ) = explode( ‘: ‘ , $ line );

$ headers [ $ key ] = $ value ;

This code doesn’t handle the case of multiple header lines. For example, multiple Set-Cookie lines.

Added multi headers support and wrapped into a function :

function parse_headers($headers)< $results = []; foreach (array_filter(explode("\r\n", trim($headers))) as $line) < list ($key, $value) = explode(':', $line, 2); $key = trim($key); $value = trim($value); if (isset($results[$key])) < if (is_array($results[$key])) $results[$key][] = $value; else $results[$key] = [$results[$key], $value]; > else < $results[$key] = $value; > > return $results; >

$headers = parse_headers(substr($response, 0, strpos($response, "\r\n\r\n")));

Источник

Parsing Response Headers in PHP

When using file_get_contents to perform HTTP requests, the server response headers is stored in a reserved variable after each successful request; we can iterate over this when we need to access individual response headers.

When you are using the build-in PHP file- functions to perform HTTP requests, the response headers will automatically be made available in a special variable, $http_response_header; this is very useful when using the build-in file functions, but it does not work when using the cURL library. In this tutorial, you will learn how to parse the request headers regardless if your are using cURL or file- functions like file_get_contents:

It is a bit strange that the headers are stored as an indexed array, instead of a more user-friendly associative array; but this is just a small inconvenience, since we can easily convert to an associative array on our own.

To parse the response headers and create an associative array, we can use this solution for the build-in file- functions:

$response_headers = []; $status_message = array_shift($http_response_header); foreach ($http_response_header as $value)  if(false !== ($matches = explode(':', $value, 2)))  $response_headers["$matches[0]>"] = trim($matches[1]); > >

And this one when using the cURL library:

// Define the $response_headers array for later use $response_headers = []; // Get the first line (The Status Code) $line = strtok($headers, "\r\n"); $status_code = trim($line); // Parse the string, saving it into an array instead while (($line = strtok("\r\n")) !== false)  if(false !== ($matches = explode(':', $line, 2)))  $response_headers["$matches[0]>"] = trim($matches[1]); > >

Doing this makes it possible to easily check if a given header exists, simply by using isset on the array key:

if (isset($response_headers["content-type"]))  echo 'The "content-type" header was found, and the content is:
'; echo $response_headers["content-type"]; exit(); >

File- functions

As mentioned earlier, to obtain the response headers using PHP’s build-in file functions, you can iterate over the $http_response_header variable; PHP will automatically make the response headers available to you as an indexed array in this variable, after you have performed a HTTP request.

The functions used to make HTTP requests using the file-functions commonly include file_get_contents, stream_context_create, and stream_get_contents — how to use them is covered in other tutorials.

The first element in the $http_response_header array is always the HTTP status code — even when reading the raw headers the status code always comes first. It may be useful to store the status code in a separate variable:

$status_message = array_shift($http_response_header);

The array_shift function serve two purposes here:

It returns the first element in the array.
It removes the first element from the array.

Note. While it is easier to use a regular expression, it is about 10% to 68% faster to use either a combination of stripos and substr or explode — not that it matters much in practice — but nevertheless, I think we should stick to what is fastest.

All the following approaches are actually fairly straight forward to use, so which to use should probably be down to whichever is the most efficient. Solution 1: Using the explode function is about 68% faster than using preg_match:

$headers = array(); $status_message = array_shift($http_response_header); foreach ($http_response_header as $value)  if(false !== ($matches = explode(':', $value, 2)))  $headers["$matches[0]>"] = trim($matches[1]); > >

Solution 2: This is how to use stripos and substr, which is about 10% faster than using a regular expression:

$headers = array(); $status_message = array_shift($http_response_header); foreach ($http_response_header as $value)  $pos = stripos($value, ':'); $key = substr($value, 0, $pos); $value = substr($value, $pos+1); $headers["$key"] = trim($value); > print_r($headers);

Solution 3: If you for some reason prefer to use a regular expression, feel free to do so; the speed difference will be insignificant for most websites — but keep in mind, the more concurrent users you got, the more you will benefit from even minor optimizations. Here is how to use preg_match to do the same:

$headers = array(); $status_message = array_shift($http_response_header); foreach ($http_response_header as $value)  if (preg_match('/^([^:]+):([^\n]+)/', $value, $matches))  $headers["$matches[1]>"] = trim($matches[2]); > > print_r($headers);

Of course we could also get rid of the trim function when using regular expressions; I only left it there to get a fair benchmark result.

How the benchmarks was performed

$start_time = microtime(true); $repeat = 0; $status_message = array_shift($http_response_header); while ($repeat  1000000)  $headers = array(); $status_message = array_shift($http_response_header); foreach ($http_response_header as $value)  if(false !== ($matches = explode(':', $value, 2)))  $headers["$matches[0]>"] = trim($matches[1]); > > ++$repeat; > $end_time = microtime(true); echo $end_time - $start_time . "\n\n"; var_dump($headers);exit();

cURL

Obtaining the response headers with cURL is more difficult, since they will not be immediately available, unlike when you use the build-in file- functions. Instead, we will have to manually extract the headers from the request. This can be done with the curl_getinfo function, after having performed a request:

$header_size = curl_getinfo($ch, CURLINFO_HEADER_SIZE); $headers = substr($response, 0, $header_size); $body = substr($response, $header_size);

The headers are then stored as a string in the $headers variable. In order to create an associative array from the string, you can iterate over each line in the string, saving its contents into the array as you go:

// Define the $response_headers array for later use $response_headers = []; // Get the first line (The Status Code) $line = strtok($headers, "\r\n"); $status_code = trim($line); // Parse the string, saving it into an array instead while (($line = strtok("\r\n")) !== false)  if(false !== ($matches = explode(':', $line, 2)))  $response_headers["$matches[0]>"] = trim($matches[1]); > >

Update based on your comments

I have been testing various situations when parsing raw cURL headers, and I put op the following test. This test is still work in progress, but may be useful for some to follow, even in its current early state. I am busy with work, so have not had time to finish this test yet. Apparently HTTP headers used to be able to fold to the next line, although this is not discouraged. I do not know if this is a problem when relying on PHP’s inbuilt $http_response_headers, but it could be when using cURL. I have not yet tested that. Header lines longer than 1024 characters will be ignored according to php.net. The following test-code is for cURL, and should account for folded lines as well as headers with no name (headers that just start with «:». I’ll probably return to fine tune this later if needed.

// C1. Normal headers $headers_str = "200 Ok\r\n" // C2. Normal headers . "test: haaa\r\n" . "test2: haaa2\r\n" // C3. Folding of header lines; Note. If I understand the spec correctly, folded lines start with a single blank space . "date: Thu, 23 Sep \n 2021 06:25:14 GMT\r\n" // C4. Malformed header that results in an empty name . ": test\r\n"; // Define the $response_headers array for later use $response_headers = []; // C1. Get the first line (The Status Code) $line = strtok($headers_str, "\r\n"); $status_code = trim($line); $last_header = null; // C2. Parse the string, saving it into an array instead while (($line = strtok("\r\n")) !== false)  // C3. If the header is folded over multiple lines if ($line[0] !== ' ')  if ((false !== ($matches = explode(':', $line, 2))) // C4. Ignore if name is empty && ($matches[0] !== ''))  $response_headers["$matches[0]>"] = trim($matches[1]); $last_header = $matches[0]; > > else  if ($last_header !== null)  $response_headers["$last_header"] .= $line; > > > echo "\n$status_code\n\n"; print_r($response_headers);

Conclusion

It does not seem matter much in practice whether we use string functions or regular expressions — at least not for simple stuff such as this. But I still recommend using what is know to be the fastest option. In this case, without the trim function, using preg_match is only about 10% slower than stripos and substr — but a massive 68% slower than using explode. While this may look like a lot — and it is under some circumstances — we should keep in mind that benchmark tests will often execute a script more than a million times in order to get a more clear picture. This means that we would need hundreds or thousands of concurrent users before we will notice any meaningful difference. However, I will personally always pick the solution that I know to be faster, especially when the solutions are so easy to work with. preg_match is only going to be easier to read if the developer understands regular expressions; in practice this should not matter, and a good developer should be willing to learn both.

Источник

gonejack / $http_response_header_parser.php

/**

* Parse a set of HTTP headers

* @param array The php headers to be parsed

* @param [string] The name of the header to be retrieved

* @return A header value if a header is passed;

* An array with all the headers otherwise

function parseHeaders ( array $ headers , $ header = null )

$ output = array ();

if ( ‘HTTP’ === substr( $ headers [ 0 ], 0 , 4 ))

list (, $ output [ ‘status’ ], $ output [ ‘status_text’ ]) = explode( ‘ ‘ , $ headers [ 0 ]);

unset( $ headers [ 0 ]);

foreach ( $ headers as $ v )

$ h = preg_split( ‘/:\s*/’ , $ v );

$ output [strtolower( $ h [ 0 ])] = $ h [ 1 ];

if ( null !== $ header )

if (isset( $ output [strtolower( $ header )]))

return $ output [strtolower( $ header )];

return ;

return $ output ;

$http_response_header is a Array like

Array

(

[0] => HTTP/1.1 200 OK

[1] => Date: Thu, 14 Oct 2010 09:46:18 GMT

[2] => Server: Apache/2.2

[3] => Last-Modified: Sat, 07 Feb 2009 16:31:04 GMT

[4] => ETag: «340011c-3614-46256a9e66200»

[5] => Accept-Ranges: bytes

[6] => Content-Length: 13844

[7] => Vary: User-Agent

[8] => Expires: Thu, 15 Apr 2020 20:00:00 GMT

[9] => Connection: close

[10] => Content-Type: image/png

)

which is useful when you what to know the situation being after using the file_get_contents(), but it’s not that easy to get the status code or status text since it’s a wrap of HTTP headers, so here’s a parser(from internet) to make it more easy to get specific header.

Источник