Parsing javascript with python

Содержание

Saved searches
Use saved searches to filter your results more quickly
License
PiotrDabkowski/pyjsparser
Name already in use
Sign In Required
Launching GitHub Desktop
Launching GitHub Desktop
Launching Xcode
Launching Visual Studio Code
Latest commit
Git stats
Files
README.md
About
Как парсить html страничку с JavaScript в python 3?
JavaScript parser in Python [closed]
5 Answers 5

Saved searches

Use saved searches to filter your results more quickly

You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session.

Fast JavaScript parser for Python.

License

PiotrDabkowski/pyjsparser

This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

Enable partial ES6 Arrow function support

Git stats

Files

Failed to load latest commit information.

README.md

Fast JavaScript parser — manual translation of esprima.js to python. Takes 1 second to parse whole angular.js library so parsing speed is about 100k characters per second which makes it the fastest and most comprehensible JavaScript parser for python out there.

Supports whole ECMAScript 5.1 and parts of ECMAScript 6. The documentation for the generated AST can be found here.

>>> from pyjsparser import parse >>> parse('var $ = "Hello!"') < "type": "Program", "body": [ < "type": "VariableDeclaration", "declarations": [ < "type": "VariableDeclarator", "id": < "type": "Identifier", "name": "$" >, "init": < "type": "Literal", "value": "Hello!", "raw": '"Hello!"' > > ], "kind": "var" > ] >

About

Fast JavaScript parser for Python.

Источник

Как парсить html страничку с JavaScript в python 3?

Чтобы достать статические данные из html, javascript текста, можно использовать соответствующие парсеры, такие как BeautifulSoup, slimit. Пример: Как, используя Beautuful Soup, искать по ключевому слову если это слово находится в теге script?

Чтобы получить информацию с web-странички, элементы которой javascript динамически генерирует, можно web-браузером воспользоваться. Чтобы управлять разными браузерами из Питона, selenium webdriver помогает: пример с показом GUI. Есть и другие библиотеки, к примеру: marionette (firefox), pyppeteer (chrome, puppeteer API для Питона) — пример получения снимка экрана с web-страницей с использованием этих библиотек. Чтобы получить html страницу, не показывая GUI, можно «безголовый» (headless) Google Chrome запустить и с использованием selenium:

from selenium import webdriver # $ pip install selenium options = webdriver.ChromeOptions() options.add_argument('--headless') # get chromedriver from # https://sites.google.com/a/chromium.org/chromedriver/downloads browser = webdriver.Chrome(chrome_options=options) browser.get('https://ru.stackoverflow.com/q/749943') # . other actions generated_html = browser.page_source browser.quit()

Этот интерфейс позволяет автоматизировать действия пользователя (нажатие клавиш, кнопок, поиск элементов на странице по различным критериям, итд). Анализ полезно на две части разбить: загрузить из сети динамически генерируемую информацию с помощью браузера и сохранить её (возможно наличие избыточной информации), а затем детально анализировать уже статическое содержимое, чтобы изъять только необходимые части (возможно без сети в другом процессе с помощью того же BeautifulSoup). К примеру, чтобы найти ссылки на похожие вопросы на сохранённой странице:

from bs4 import BeautifulSoup soup = BeautifulSoup(generated_html, 'html.parser') h = soup.find(id='h-related') related = [a['href'] for a in h.find_all('a', 'question-hyperlink')]

Если сайт предоставляет API (официальное или подсмотренное в сетевых запросах выполняемых javascript: пример для fifa.com), то это может быть более предпочтительным вариантом по сравнению с выдёргиванием информации c UI элементов web-страницы: пример, использования Stack Exchange API.

Часто можно встретить REST API или GraphQL API, которые удобно с помощью requests или специализированных библиотек использовать (по ссылкам примеры кода для github api).

Источник

JavaScript parser in Python [closed]

Closed. This question is seeking recommendations for books, tools, software libraries, and more. It does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

There is a JavaScript parser at least in C and Java (Mozilla), in JavaScript (Mozilla again) and Ruby. Is there any currently out there for Python? I don’t need a JavaScript interpreter, per se, just a parser that’s up to ECMA-262 standards. A quick google search revealed no immediate answers, so I’m asking the SO community.

5 Answers 5

Nowadays, there is at least one better tool, called slimit :

SlimIt is a JavaScript minifier written in Python. It compiles JavaScript into more compact code so that it downloads and runs faster.

SlimIt also provides a library that includes a JavaScript parser, lexer, pretty printer and a tree visitor.

Imagine we have the following javascript code:

And now we need to get email , phone and name values from the data object.

The idea here would be to instantiate a slimit parser, visit all nodes, filter all assignments and put them into the dictionary:

from slimit import ast from slimit.parser import Parser from slimit.visitors import nodevisitor data = """ $.ajax( < type: "POST", url: 'http://www.example.com', data: < email: 'abc@g.com', phone: '9999999999', name: 'XYZ' >>); """ parser = Parser() tree = parser.parse(data) fields = print fields

ANTLR, ANother Tool for Language Recognition, is a language tool that provides a framework for constructing recognizers, interpreters, compilers, and translators from grammatical descriptions containing actions in a variety of target languages.

The ANTLR site provides many grammars, including one for JavaScript.

As it happens, there is a Python API available — so you can call the lexer (recognizer) generated from the grammar directly from Python (good luck).

I have translated esprima.js to Python:

>>> from pyjsparser import parse >>> parse('var $ = "Hello!"') < "type": "Program", "body": [ < "type": "VariableDeclaration", "declarations": [ < "type": "VariableDeclarator", "id": < "type": "Identifier", "name": "$" >, "init": < "type": "Literal", "value": "Hello!", "raw": '"Hello!"' >> ], "kind": "var" > ] >

It’s a manual translation so its very fast, takes about 1 second to parse angular.js file (so 100k characters per second). It supports whole ECMAScript 5.1 and parts of version 6 — for example Arrow functions, const , let .

If you need support for all the newest JS6 features you can translate esprima on the fly with Js2Py:

import js2py esprima = js2py.require("esprima@4.0.1") esprima.parse("a = () => ;") # , 'operator': '=', 'right': , 'type': 'ReturnStatement'>], 'type': 'BlockStatement'>, 'expression': False, 'generator': False, 'id': None, 'params': [], 'type': 'ArrowFunctionExpression'>, 'type': 'AssignmentExpression'>, 'type': 'ExpressionStatement'>], 'sourceType': 'script', 'type': 'Program'>

Tested and work pretty well. You can use it and reconstruct for example some JSON data from it, captured by crawlers.

As pib mentioned, pynarcissus is a Javascript tokenizer written in Python. It seems to have some rough edges but so far has been working well for what I want to accomplish.

Updated: Took another crack at pynarcissus and below is a working direction for using PyNarcissus in a visitor pattern like system. Unfortunately my current client bought the next iteration of my experiments and have decided not to make it public source. A cleaner version of the code below is on gist here

from pynarcissus import jsparser from collections import defaultdict class Visitor(object): CHILD_ATTRS = ['thenPart', 'elsePart', 'expression', 'body', 'initializer'] def __init__(self, filepath): self.filepath = filepath #List of functions by line # and set of names self.functions = defaultdict(set) with open(filepath) as myFile: self.source = myFile.read() self.root = jsparser.parse(self.source, self.filepath) self.visit(self.root) def look4Childen(self, node): for attr in self.CHILD_ATTRS: child = getattr(node, attr, None) if child: self.visit(child) def visit_NOOP(self, node): pass def visit_FUNCTION(self, node): # Named functions if node.type == "FUNCTION" and getattr(node, "name", None): print str(node.lineno) + " | function " + node.name + " | " + self.source[node.start:node.end] def visit_IDENTIFIER(self, node): # Anonymous functions declared with var name = function() <>; try: if node.type == "IDENTIFIER" and hasattr(node, "initializer") and node.initializer.type == "FUNCTION": print str(node.lineno) + " | function " + node.name + " | " + self.source[node.start:node.initializer.end] except Exception as e: pass def visit_PROPERTY_INIT(self, node): # Anonymous functions declared as a property of an object try: if node.type == "PROPERTY_INIT" and node[1].type == "FUNCTION": print str(node.lineno) + " | function " + node[0].value + " | " + self.source[node.start:node[1].end] except Exception as e: pass def visit(self, root): call = lambda n: getattr(self, "visit_%s" % n.type, self.visit_NOOP)(n) call(root) self.look4Childen(root) for node in root: self.visit(node) filepath = r"C:\Users\dward\Dropbox\juggernaut2\juggernaut\parser\test\data\jasmine.js" outerspace = Visitor(filepath)

Источник

Parsing javascript with python

Saved searches

Use saved searches to filter your results more quickly

License

PiotrDabkowski/pyjsparser

Name already in use

Sign In Required

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

About

Как парсить html страничку с JavaScript в python 3?

JavaScript parser in Python [closed]

5 Answers 5