Commit 20730487 authored by Lukas Rieder's avatar Lukas Rieder

updated README and documentation

parent 67b7193a
......@@ -2,7 +2,7 @@
Bindings for lexborisov's [myhtml](https://github.com/lexborisov/myhtml).
* Available as a hex package: `{:myhtmlex, "~> 0.1.0"}`
* Available as a hex package: `{:myhtmlex, "~> 0.2.0"}`
* [Documentation](https://hexdocs.pm/myhtmlex/Myhtmlex.html)
## Example
......@@ -10,42 +10,85 @@ Bindings for lexborisov's [myhtml](https://github.com/lexborisov/myhtml).
iex> Myhtmlex.decode("<h1>Hello world</h1>")
{"html", [], [{"head", [], []}, {"body", [], [{"h1", [], ["Hello world"]}]}]}
## Thoughts
Benchmark results (Nif calling mode) on various file sizes on a 2,5Ghz Core i7:
I need to a fast html-parsing library in Erlang/Elixir.
So falling back to c, and to myhtml especially, is a natural move.
Settings:
duration: 1.0 s
But Erlang interoperability is a tricky mine-field.
This increase in parsing speed does not come for free.
## FileSizesBench
[15:28:42] 1/3: github_trending_js.html 341k
[15:28:46] 2/3: w3c_html5.html 131k
[15:28:48] 3/3: wikipedia_hyperlink.html 97k
The current implementation can be considered a proof-of-concept.
The myhtml code is called as a dirty-nif and executed **inside the Erlang-VM**.
Thus completely giving up the safety of the Erlang-VM. I am not saying that myhtml is unsafe, but
the slightest Segfault brings down the whole Erlang-VM.
So, I consider this mode of operation unsafe, and **not recommended for production use**.
Finished in 7.52 seconds
The other option, that I have on my roadmap, is to call into a C-Node.
A separate OS-process that receives calls from erlang and returns to the calling process.
## FileSizesBench
benchmark name iterations average time
wikipedia_hyperlink.html 97k 1000 1385.86 µs/op
w3c_html5.html 131k 1000 2179.30 µs/op
github_trending_js.html 341k 500 5686.21 µs/op
Another option is to call into a Port driver.
A separate OS-process that communicates via stdin/stdout.
## Configuration
So to recap, I want a **fast** and **safe** html-parsing library for Erlang/Elixir.
The module you are calling into is always `Myhtmlex` and depending on your application configuration,
it chooses between the underlying implementations `Myhtmlex.Safe` (default) and `Myhtmlex.Nif`.
Not quite there, yet.
Erlang interoperability is a tricky mine-field.
You can call into C directly using native implemented functions (Nif). But this comes with the risk,
that if anything goes wrong within the C implementation, your whole VM will crash.
No more supervisor cushions for here on, just violent crashes.
## Development
That is why the default mode of operation keeps your VM safe and happy.
If you need ultimate parsing speed, or you can simply tolerate VM-level crashes, read on.
### Call into C-Node (default)
This is the default mode of operation.
If your application cannot tolerate VM-level crashes, this option allows you to gain the best of both worlds.
The added overhead is client/server communications, and a worker OS-process that runs next to your VM under VM supervision.
You do not have to do anything to start the worker process, everything is taken care of within the library.
If you are not running in distributed mode, your VM will automatically be assigned a `sname`.
The worker OS-process stays alive as long as it is under VM-supervision. If your VM goes down, the OS-process will die by itself.
If the worker OS-process dies for some reason, your VM stays unaffected and will attempt to restart it seamlessly.
### Call into Nif
If your application is aiming for ultimate parsing speed, and in the worst case can tolerate VM-level crashes, you can call directly into the Nif.
1. Require myhtmlex without runtime
in your `mix.exs`
def deps do
[
{:myhtmlex, ">= 0.0.0", runtime: false}
]
end
2. Configure the mode to `Myhtmlex.Nif`
e.g. in `config/config.exs`
config :myhtmlex, mode: Myhtmlex.Nif
3. Bonus: You can [open up in-memory references to parsed trees](https://hexdocs.pm/myhtmlex/Myhtmlex.html#open/1), without parsing + mapping erlang terms in one go
## Contribution / Bug Reports
* Please make sure you do `git submodule update` after a checkout/pull
* If you have problems building the project, please consider adding a Dockerfile to `build-tests/` to replicate the build error
* The project aims to be fully tested
## Status
## Roadmap
Currently under development.
The exposed functions on `Myhtmlex` are not subject to change.
This project is under active development.
* [x] Parse a HTML-document into a tree
* [ ] Expose node-retrieval functions
* [ ] Investigate safety and calling options
* [x] Parse a HTML-document into a tree
* [x] Investigate safety and calling options
* [x] Call as dirty-nif
* [x] Call as C-Node (check branch `c-node`)
* [ ] Call as Port driver
......@@ -10,7 +10,7 @@ defmodule Myhtmlex do
iex> Myhtmlex.decode("<h1>Hello world</h1>")
{"html", [], [{"head", [], []}, {"body", [], [{"h1", [], ["Hello world"]}]}]}
Benchmark results on various file sizes on a 2,5Ghz Core i7:
Benchmark results (Nif calling mode) on various file sizes on a 2,5Ghz Core i7:
Settings:
duration: 1.0 s
......@@ -28,29 +28,52 @@ defmodule Myhtmlex do
w3c_html5.html 131k 1000 2179.30 µs/op
github_trending_js.html 341k 500 5686.21 µs/op
## Thoughts
## Configuration
I need to a fast html-parsing library in Erlang/Elixir.
So falling back to c, and to myhtml especially, is a natural move.
The module you are calling into is always `Myhtmlex` and depending on your application configuration,
it chooses between the underlying implementations `Myhtmlex.Safe` (default) and `Myhtmlex.Nif`.
But Erlang interoperability is a tricky mine-field.
This increase in parsing speed does not come for free.
Erlang interoperability is a tricky mine-field.
You can call into C directly using native implemented functions (Nif). But this comes with the risk,
that if anything goes wrong within the C implementation, your whole VM will crash.
No more supervisor cushions for here on, just violent crashes.
The current implementation can be considered a proof-of-concept.
The myhtml code is called as a dirty-nif and executed **inside the Erlang-VM**.
Thus completely giving up the safety of the Erlang-VM. I am not saying that myhtml is unsafe, but
the slightest Segfault brings down the whole Erlang-VM.
So, I consider this mode of operation unsafe, and **not recommended for production use**.
That is why the default mode of operation keeps your VM safe and happy.
If you need ultimate parsing speed, or you can simply tolerate VM-level crashes, read on.
The other option, that I have on my roadmap, is to call into a C-Node.
A separate OS-process that receives calls from erlang and returns to the calling process.
### Call into C-Node (default)
Another option is to call into a Port driver.
A separate OS-process that communicates via stdin/stdout.
This is the default mode of operation.
If your application cannot tolerate VM-level crashes, this option allows you to gain the best of both worlds.
The added overhead is client/server communications, and a worker OS-process that runs next to your VM under VM supervision.
So to recap, I want a **fast** and **safe** html-parsing library for Erlang/Elixir.
You do not have to do anything to start the worker process, everything is taken care of within the library.
If you are not running in distributed mode, your VM will automatically be assigned a `sname`.
Not quite there, yet.
The worker OS-process stays alive as long as it is under VM-supervision. If your VM goes down, the OS-process will die by itself.
If the worker OS-process dies for some reason, your VM stays unaffected and will attempt to restart it seamlessly.
### Call into Nif
If your application is aiming for ultimate parsing speed, and in the worst case can tolerate VM-level crashes, you can call directly into the Nif.
1. Require myhtmlex without runtime
in your `mix.exs`
def deps do
[
{:myhtmlex, ">= 0.0.0", runtime: false}
]
end
2. Configure the mode to `Myhtmlex.Nif`
e.g. in `config/config.exs`
config :myhtmlex, mode: Myhtmlex.Nif
3. Bonus: You can [open up in-memory references to parsed trees](https://hexdocs.pm/myhtmlex/Myhtmlex.html#open/1), without parsing + mapping erlang terms in one go
"""
@type tag() :: String.t | atom()
......@@ -127,7 +150,7 @@ defmodule Myhtmlex do
end
@doc """
Returns a reference to an internally parsed myhtml_tree_t.
Returns a reference to an internally parsed myhtml_tree_t. (Nif only!)
"""
@spec open(String.t) :: reference()
def open(bin) do
......@@ -135,7 +158,7 @@ defmodule Myhtmlex do
end
@doc """
Returns a tree representation from the given reference. See `decode/1` for example output.
Returns a tree representation from the given reference. See `decode/1` for example output. (Nif only!)
"""
@spec decode_tree(reference()) :: tree()
def decode_tree(ref) do
......@@ -143,7 +166,7 @@ defmodule Myhtmlex do
end
@doc """
Returns a tree representation from the given reference. See `decode/2` for options and example output.
Returns a tree representation from the given reference. See `decode/2` for options and example output. (Nif only!)
"""
@spec decode_tree(reference(), format: [format_flag()]) :: tree()
def decode_tree(ref, format: flags) do
......
......@@ -9,23 +9,18 @@ defmodule Myhtmlex.Nif do
:ok = :erlang.load_nif(path, 0)
end
@doc false
def decode(bin)
def decode(_), do: exit(:nif_library_not_loaded)
@doc false
def decode(bin, flags)
def decode(_, _), do: exit(:nif_library_not_loaded)
@doc false
def open(bin)
def open(_), do: exit(:nif_library_not_loaded)
@doc false
def decode_tree(tree)
def decode_tree(_), do: exit(:nif_library_not_loaded)
@doc false
def decode_tree(tree, flags)
def decode_tree(_, _), do: exit(:nif_library_not_loaded)
end
......
defmodule Myhtmlex.Safe do
@moduledoc """
Safely decode html using a C-Node. Any problem with myhtml and the c-binding will not affect the Erlang VM.
"""
@moduledoc false
use Application
app = Mix.Project.config[:app]
@doc false
def start(_type, _args) do
import Supervisor.Spec
unless Node.alive? do
......@@ -20,12 +17,10 @@ defmodule Myhtmlex.Safe do
Supervisor.start_link(children, strategy: :one_for_one, name: Myhtmlex.Safe.Supervisor)
end
@doc false
def decode(bin) do
decode(bin, [])
end
@doc false
def decode(bin, flags) do
{:ok, res} = Nodex.Cnode.call(__MODULE__, {:decode, bin, flags})
res
......
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment