chevron_right

Ignite Realtime Blog: Certificate Manager plugin for Openfire release 1.1.1

news.movim.eu / PlanetJabber • 20 July, 2023

The Ignite Realtime community is happy to announce a new release of the Certificate Manager plugin for Openfire.

This plugin allows you to automate TLS certificate management tasks. This is particularly helpful when your certificates are short-lived, like the ones issued by Let’s Encrypt.

This release is a maintenance release. It adds translations. More details are available in the changelog .

Your instance of Openfire should automatically display the availability of the update in the next few hours. Alternatively, you can download the new release of the plugin at the Certificate Manager plugin archive page .

If you have any questions, please stop by our community forum or our live groupchat .

For other release announcements and news follow us on Twitter and Mastodon .

1 post - 1 participant

Read full topic

chevron_right

Ignite Realtime Blog: JmxWeb plugin for Openfire 0.9.1 release

news.movim.eu / PlanetJabber • 20 July, 2023

The Ignite Realtime community is happy to announce a new release of the JmxWeb plugin for Openfire.

This plugin provides a web based platform for managing and monitoring Openfire via JMX

This release is a maintenance release. It adds translations and fixes one bug. More details are available in the changelog .

Your instance of Openfire should automatically display the availability of the update in the next few hours. Alternatively, you can download the new release of the plugin at the JmxWeb plugin archive page .

If you have any questions, please stop by our community forum or our live groupchat .

For other release announcements and news follow us on Twitter and Mastodon .

1 post - 1 participant

Read full topic

chevron_right

Ignite Realtime Blog: Push Notification Openfire plugin 0.9.2 released

news.movim.eu / PlanetJabber • 20 July, 2023

The Ignite Realtime community is happy to announce a new release of the Push Notification plugin for Openfire.

This plugin enables clients to register for push notifications.

This release is a maintenance release. It adds translations and a configuration page. More details are available in the changelog

Your instance of Openfire should automatically display the availability of the update in the next few hours. Alternatively, you can download the new release of the plugin at the Push Notification plugin archive page .

If you have any questions, please stop by our community forum or our live groupchat .

For other release announcements and news follow us on Twitter and Mastodon .

1 post - 1 participant

Read full topic

chevron_right

Ignite Realtime Blog: Search Openfire plugin 0.7.4 release!

news.movim.eu / PlanetJabber • 20 July, 2023

The Ignite Realtime community is happy to announce a new release of the Search plugin for Openfire.

This plugin adds features to Openfire that makes it easier for users to find each-other.

This release is a maintenance release. It adds translations. More details are available in the changelog

Your instance of Openfire should automatically display the availability of the update in the next few hours. Alternatively, you can download the new release of the plugin at the Search plugin archive page .

If you have any questions, please stop by our community forum or our live groupchat .

For other release announcements and news follow us on Twitter and Mastodon .

1 post - 1 participant

Read full topic

chevron_right

Ignite Realtime Blog: Candy plugin for Openfire 2.2.0 Release 4 now available!

news.movim.eu / PlanetJabber • 20 July, 2023

The Ignite Realtime community is happy to announce a new release of the Openfire plugin for Candy.

Candy is a third-party chat client . The Openfire plugin makes deploying it a one-click affair!

This release is a maintenance release. It adds translations and updates dependencies on third-party libraries. More details are available in the changelog .

Your instance of Openfire should automatically display the availability of the update in the next few hours. Alternatively, you can download the new release of the plugin at the candy plugin archive page .

If you have any questions, please stop by our community forum or our live groupchat .

For other release announcements and news follow us on Twitter and Mastodon .

1 post - 1 participant

Read full topic

chevron_right

Erlang Solutions: How IoT is Revolutionising Supply Chain Management

news.movim.eu / PlanetJabber • 20 July, 2023 • 5 minutes

As global supply chains continue to face significant disruptions, many businesses are turning to IoT to access greater visibility, reactivity, and streamlined operations.

Unforeseen geopolitical conflicts, economic pressures due to inflation and severe climate change events have all contributed to an uncertain and costly supply chain environment for companies worldwide in 2023.

To soften some of these impacts, and to work towards a more intelligent, forward-thinking form of supply chain management, industry leaders continue to turn to the benefits offered by the Internet of Things, or IoT.
By embracing IoT, your company can transform a scattered supply chain into a fully connected network. In doing so, you’ll be able to access a wide range of benefits like increased visibility and superior inventory management, whilst preparing your company’s foundations for the future of distribution.

The Future of Distribution: Recent IoT Impacts on the Supply Chain

Whilst it certainly represents the future of supply chain logistics, IoT adoption across multiple different industries has already happened. A recent survey by PwC found that in 2023, 46% of companies have already invested in IoT to the point where it’s fully adopted by their supply chain, second only to cloud-based data platforms.

Adoption to win future investment in supply chains.

https://www.pwc.com/us/en/services/consulting/business-transformation/digital-supply-chain-survey/supply-chain-tech.html

When predicting the future of supply chain technology back in 2021, Gartner also claimed that 50% of organisations will have invested in solutions that support AI and advanced analytical capabilities, like IoT, by 2024. They also predicted that by 2025, 50% of organisations will have employed a technology leadership role who will report directly to their chief supply chain officer (CSCO).

The existence and normalisation of a CSCO role itself evidences that supply chain management now plays a vital role in c-suite level decision-making for many global businesses. By predicting that CSCOs will soon be naturally reinforced by a senior tech leader in half of all organisations, Gartner has also shown that effective supply chain management today must be intertwined with new technologies like IoT.

To better understand the accuracy and importance of this prediction, it’s vital to explore the role IoT plays in supply chain management at present.

The Role of IoT in Supply Chain Management

IoT can be applied to practically every stage of a supply chain. In fact, due to its communicative nature, it’s advisable to apply IoT across an entire supply chain to embrace the benefits of a fully connected supply network.

The first, and perhaps most well-known, role of IoT within the supply chain is its capacity to provide real-time location tracking. This is often used to allow customers to track packages en route to their destination, but internally this feature also ensures companies have total visibility over all stages of distribution.

Increasing visibility means IoT can contribute to more accurate arrival time estimations. This also means businesses can quickly react to any unexpected issues that arise within their supply chain.

In doing so, IoT can help companies achieve greater risk mitigation, whilst simultaneously providing insights that can support contingency planning. One unique example of a company benefitting from IoT risk mitigation — within both supply chain management as well as customer experience — is Volvo. They now use IoT to track vehicle delivery as well as to provide stolen vehicle tracking for customers.

Monitoring can also extend to items in storage, which is particularly important for companies shipping perishable goods. In these instances, IoT allows for visibility and control over the environmental conditions of stored packages and equipment. Finally, individual shipments can also be located, speeding up the process of sourcing, identifying and managing goods when held in warehouses or distribution centres.

Many leading companies have already embraced IoT storage monitoring; for example, Ericsson recently implemented digital asset-tracking solutions in their new 5G smart factories to track critical asset locations.

The Benefits of Utilising IoT in Supply Chain Management

The following represent a handful of the key benefits your company could access by investing in IoT across your supply chain.

• New, Visible Opportunities

Many of the roles of IoT listed above contribute towards increased visibility over your supply chain. This level of visibility doesn’t just improve resilience and streamline operations; it can also provide insights that reveal entirely new opportunities.

These could include opening the door for automation, smart packaging that enables customers to interact directly with products, or unearthing potential improvements like route remapping that can further optimise your overall chain.

• Improved Communication Internally and With Customers

The data analysis capabilities offered by IoT allow your teams to better communicate with each other, as each team can access detailed information on the current nature of your supply chain.

This extends to the communication you can offer customers, enhancing their overall customer experience thanks to clear delivery times, the ability to provide alternative arrangements and quick resolutions to problems or disruptions.

• Meeting Regulations and Sustainability Requirements

IoT can provide a digital footprint of your supply chain, which is easier to optimise and can ensure you provide accurate reporting to meet ever-changing regulations.

Being able to optimise and streamline your supply chain can also mitigate unnecessary emissions, which can help your company work towards more sustainable operations. Gartner’s study found over half of today’s customers will only do business with companies who practice environmental and social sustainability, and the importance of engaging in sustainable supply chain management will only grow in importance in the coming years.

• A Cost-Effective Solution

Technology adoption, particularly of new or emerging technologies companywide, is often an expensive undertaking.
However, IoT represents a proven solution and a relatively affordable technology to implement (with future innovations likely to lower costs further ), making it the ideal choice for budget-minded decision-makers.

How to Implement and Scale Supply Chain IoT

Effective IoT supply chain investment must be scalable, and accessing the above benefits demands that decision-makers solicit support from experts in the space.

Optimising an entire chain requires a reliable, proven MQTT Messaging Engine like EMQ X .

By using EMQ X, your business can connect over 50 million different devices, with the potential to handle tens of millions of concurrent clients at any one time. This makes EMQ X massively scalable, which is why it’s already the IoT supply chain management solution of choice for hundreds of leading companies worldwide.

Our IoT Erlang Solutions specialists have worked closely with EMQ X, with over 20 years of experience building real-time distributed systems. In addition to consulting on projects across any stage, we also offer regular health checks, EMQ X support services and monitoring to ensure your system remains reliable.

If you’d like to learn more about how to access Erlang Solutions supply chain optimisations through EMQ X, make sure to contact our team today .

The post How IoT is Revolutionising Supply Chain Management appeared first on Erlang Solutions .

chevron_right

Isode: Icon-PEP 2.0 – New Capabilities

news.movim.eu / PlanetJabber • 18 July, 2023

Icon-PEP is used to enable the use of IP applications over HF networks. Using STANAG 5066 Link Layer as an interface.

Listed below are the changes brought in with 2.0.

Web Management

A web interface is provided which includes:

Full configuration of Icon-PEP
TLS (HTTPS) access and configuration including bootstrap with self signed certificate and identity management.
Control interface to enable or disable Icon-PEP
Monitoring to include:
- Access to all logging metrics
- Monitoring GRE traffic with peered routers
- Monitoring IP Client traffic to STANAG 5066
- Monitoring DNS traffic
- Monitoring TCP traffic with details of HTTP queries and responses

Profiler Enhancement

OAuth support added to control access to monitoring and configuration.

NAT Mode

A NAT (Network Address Translation) mode is introduced which supports Mobile Unit mobility for traffic initiated by Mobile Unit. Inbound IP or SLEP (TCP) traffic will have address mapped so that traffic on shore side appears to come from the local node. This avoids the need for complex IP routing to support traffic to Mobile Units not using fixed IP routing.

Other Features

Product Activation, including control of the number of Units
Filtering (previously IP client only) extended to SLEP/TCP

chevron_right

Isode: Cobalt 1.4 – New Capabilities

news.movim.eu / PlanetJabber • 18 July, 2023 • 1 minute

Cobalt proides a web interface for provisioning users and roles in an LDAP directory. It enables the easy deployment of XMPP, Email and Military Messaging systems.

Listed below are the changes brought in with 1.4.

HSM Support

Cobalt is Isode’s tool for managing PKCS#11 Hardware Security Modules (HSM) which may be used to provide improved server security by protecting PKI private keys.

Cobalt provides a generic capability to initialize HSMs and view keys
- Multiple HSMs can be configured and one set to active
- Tested with Nitrokey, Yubikey, SoftHSM and Gemalto networked HSM
Enables key pair generation and Certificate Signing Request (CSR) interaction with Certificate Authority (CA)
Support for S/MIME signing and encryption
- User identities for email
- Organization and Role identities for military messaging
Server identities that can be used for TLS with Isode servers

Isode Servers

A new tab for Isode servers is added that:

Enables HSM identities to be provisioned
Enables a password to be set, which is needed for Isode servers that bind to directory to obtain authorization, authentication and other information
Facilitates adding Isode servers to a special directory access control group, that enables passwords (usually SCRAM hashed) to be read, to enable SCRAM and other SASL mechanisms to be used by the application

Profiler Enhancement

Extend the SIC rule so that multiple SICs or SIC patterns can be set in a single rule

chevron_right

Erlang Solutions: Re-implement our first blog scrapper with Crawly 0.15.0

news.movim.eu / PlanetJabber • 25 April, 2023 • 14 minutes

It has been almost four years since my first article about scraping with Elixir and Crawly was published. Since then, many changes have occurred, the most significant being Erlang Solution’s blog design update. As a result, the 2019 tutorial is no longer functional.

This situation provided an excellent opportunity to update the original work and re-implement the Crawler using the new version of Crawly. By doing so, the tutorial will showcase several new features added to Crawly over the years and, more importantly, provide a functional version to the community. Hopefully, this updated tutorial will be beneficial to all.

First of all, why it’s broken now?

This situation is reasonably expected! When a website gets a new design, usually they redo everything—the new layout results in a new HTML which makes all old CSS/XPath selectors obselete, not even speaking about new URL schemes. As a result, the XPath/CSS selectors that were working before referred to nothing after the redesign, so we have to start from the very beginning. What a shame!

But of course, the web is done for more than just crawling. The web is done for people, not robots, so let’s adapt our robots!

Our experience from a large-scale scraping platform is that a successful business usually runs at least one complete redesign every two years. More minor updates will occur even more often, but remember that even minor updates harm your web scrapers.

Getting started

Usually, I recommend starting by following the Quickstart guide from Crawly’s documentation pages . However, this time I have something else in mind. I want to show you the Crawly standalone version.

Make it simple. In some cases, you need the data that can be extracted from a relatively simple source. In these situations, it might be quite beneficial to avoid bootstrapping all the Elixir stuff (new project, config, libs, dependencies). The idea is to deliver you data that other applications can consume without setting up.

Of course, the approach will have some limitations and only work for simple projects at this stage. Some may get inspired by this article and improve it so that the following readers will be amazed by new possibilities. In any case, let’s get straight to it now!

Bootstrapping 2.0

As promised, the simplified (compare it with the previous setup described here )version of the setup:

Create a directory for your project: mkdir erlang_solutions_blog
Create a subdirectory that will contain the code of your spiders: mkdir erlang_solutions_blog/spiders
Now, knowing that we want to extract the following fields: title, author , publishing_date, URL, article_body . Let’s define the following configuration for your project (erlang_solutions_blog/crawly.config):


[{crawly, [
   {closespider_itemcount, 100},
   {closespider_timeout, 5},
   {concurrent_requests_per_domain, 15},

   {middlewares, [
           'Elixir.Crawly.Middlewares.DomainFilter',
           'Elixir.Crawly.Middlewares.UniqueRequest',
           'Elixir.Crawly.Middlewares.RobotsTxt',
           {'Elixir.Crawly.Middlewares.UserAgent', [
               {user_agents, [
                   <<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
                   <<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
                   ]
               }]
           }
       ]
   },

   {pipelines, [
           {'Elixir.Crawly.Pipelines.Validate', [{fields, [title, author, publishing_date, url, article_body]}]},
           {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]},
           {'Elixir.Crawly.Pipelines.JSONEncoder'},
           {'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
       ]
   }]
}].

You probably have noticed that this looks like an Erlang configuration file, which is the case. I would say that it’s not the perfect solution, and one of the possible ways is to simplify it so it’s possible to configure the project more simply. If you have ideas — write me on Github’s discussions https://github.com/elixir-crawly/crawly/discussions .

4. The basic configuration is now done, and we can run the Crawly application, to see that we can start it this way:

docker run --name crawly 
-d -p 4001:4001 -v $(pwd)/spiders:/app/spiders 
-v $(pwd)/crawly.config:/app/config/crawly.config 
oltarasenko/crawly:0.15.0

Notes:

4001 — is the default HTTP port used for spiders management, so we need to forward data to it
The spiders’ directory is an expected storage of spider files that will be added to the application later on.
Finally, the ugly configuration file is also mounted inside the Crawly container.

Now you can see the Crawly Management User interface on the localhost:4001

Crawly Management Tool

Working on a new spider

Now, let’s define the spider itself. Let’s start with the following boilerplate code (put it into erlang_solutions_blog/spiders/esl.ex ):

defmodule ESLSpider do
 use Crawly.Spider

 @impl Crawly.Spider
 def init() do
   [start_urls: ["https://www.erlang-solutions.com/"]]
 end

 @impl Crawly.Spider
 def base_url(), do: "https://www.erlang-solutions.com"

 @impl Crawly.Spider
 def parse_item(response) do
   %{items: [], requests: []}
 end
end

This code defines an “ESLSpider ” module that uses the “Crawly.Spider” behavior.

The behavior requires three functions to be implemented:

teinit(), base_url(), and parse_item(response).

The “init()” function returns a list containing a single key-value pair. The key is “start_urls” and the value is a list containing a single URL string: “ https://www.erlang-solutions.com/ ” This means that the spider will start crawling from this URL.

The “base_url()” function returns a string representing the base URL for the spider, used to filter out requests that go outside of erlang-solutions.com website.

The `parse_item(response)` function takes a response object as an argument and returns a map containing two keys: `items` and `requests`

Once the code is saved, we can run it via the Web interface (it will be required to re-start a docker container or click the Reload spiders button in the Web interface).

Crawly Management Tool

Working on a new spider

Now, let’s define the spider itself. Let’s start with the following boilerplate code (put it into erlang_solutions_blog/spiders/esl.ex ):

defmodule ESLSpider do
 use Crawly.Spider

 @impl Crawly.Spider
 def init() do
   [start_urls: ["https://www.erlang-solutions.com/"]]
 end

 @impl Crawly.Spider
 def base_url(), do: "https://www.erlang-solutions.com"

 @impl Crawly.Spider
 def parse_item(response) do
   %{items: [], requests: []}
 end
end

This code defines an “ESLSpider ” module that uses the “Crawly.Spider” behavior.

The behavior requires three functions to be implemented:

teinit(), base_url(), and parse_item(response).

The “init()” function returns a list containing a single key-value pair. The key is “start_urls” and the value is a list containing a single URL string: “ https://www.erlang-solutions.com/ ” This means that the spider will start crawling from this URL.

The “base_url()” function returns a string representing the base URL for the spider, used to filter out requests that go outside of erlang-solutions.com website.

The `parse_item(response)` function takes a response object as an argument and returns a map containing two keys: `items` and `requests`

Once the code is saved, we can run it via the Web interface (it will be required to re-start a docker container or click the Reload spiders button in the Web interface).

New Crawly Management UI

Once the job is started, you can review the Scheduled Requests, Logs, or Extracted Items.

Parsing the page

Now we find CSS selectors to extract the needed data. The same approach is already described here https://www.erlang-solutions.com/blog/web-scraping-with-elixir/ under extracting the data section. I think one of the best ways to find relevant CSS selectors is by just using Google Chrome’s inspect option:

So let’s connect to the Crawly Shell and fetch data using the fetcher, extracting this title:

docker exec -it crawly /app/bin/crawly remote

1> response = Crawly.fetch("https://www.erlang-solutions.com/blog/web-scraping-with-elixir/")
2> document = Floki.parse_document!(response.body)
4> title_tag = Floki.find(document, ".page-title-sm")
[{"h1", [{"class", "page-title-sm mb-sm"}], ["Web scraping with Elixir"]}]
5> title = Floki.text(title_tag)
"Web scraping with Elixir"

We are going to extract all items this way. In the end, we came up with the following map of selectors representing the expected item:

item =
 %{
   url: response.request_url,
   title: Floki.find(document, ".page-title-sm") |> Floki.text(),
   article_body: Floki.find(document, ".default-content") |> Floki.text(),
   author: Floki.find(document, ".post-info__author") |> Floki.text(),
   publishing_date: Floki.find(document, ".header-inner .post-info .post-info__item span") |> Floki.text()
  }

requests = Enum.map(
 Floki.find(document, ".link-to-all") |> Floki.attribute("href"),
 fn url -> Crawly.Utils.request_from_url(url) end
)

At the end of it, we came up with the following code representing the spider:

defmodule ESLSpider do
 use Crawly.Spider

 @impl Crawly.Spider
 def init() do
   [
     start_urls: [
       "https://www.erlang-solutions.com/blog/web-scraping-with-elixir/",
       "https://www.erlang-solutions.com/blog/which-companies-are-using-elixir-and-why-mytopdogstatus/"
     ]
   ]
 end

 @impl Crawly.Spider
 def base_url(), do: "https://www.erlang-solutions.com"

 @impl Crawly.Spider
 def parse_item(response) do
   {:ok, document} = Floki.parse_document(response.body)

   requests = Enum.map(
     Floki.find(document, ".link-to-all") |> Floki.attribute("href"),
     fn url -> Crawly.Utils.request_from_url(url) end
     )

   item = %{
     url: response.request_url,
     title: Floki.find(document, ".page-title-sm") |> Floki.text(),
     article_body: Floki.find(document, ".default-content") |> Floki.text(),
     author: Floki.find(document, ".post-info__author") |> Floki.text(),
     publishing_date: Floki.find(document, ".header-inner .post-info .post-info__item span") |> Floki.text()
   }
   %{items: [item], requests: requests}
 end
end

That’s all, folks! Thanks for reading!

Well, not really. Let’s schedule this version of the spider again, and let’s see the results:

Scraping results

As you can see, the spider could only extract 34 items. This is quite interesting, as it’s pretty clear that Erlang Solution’s blog contains way more items. So why do we have only this amount? Can anything be done to improve it?

Debugging your spider

Some intelligent developers write everything just once, and everything works. Other people like me have to spend time debugging the code.

In my case, I start with exploring logs. There is something there I don’t like:

08:23:37.417 [info] Dropping item: %{article_body: “Scalable and Reliable Real-time MQTT Messaging Engine for IoT in the 5G Era.We work with proven, world leading technologies that provide a highly scalable, highly available distributed message broker for all major IoT protocols, as well as M2M and mobile applications.Available virtually everywhere with real-time system monitoring and management ability, it can handle tens of millions of concurrent clients.Today, more than 5,000 enterprise users are trusting EMQ X to connect more than 50 million devices.As well as being trusted experts in EMQ x, we also have 20 years of experience building reliable, fault-tolerant, real-time distributed systems. Our experts are able to guide you through any stage of the project to ensure your system can scale with confidence. Whether youâ€ re hunting for a suspected bug, or doing due diligence to future proof your system, weâ€ re here to help. Our world-leading team will deep dive into your system providing an in-depth report of recommendations. This gives you full visibility on the vulnerabilities of your system and how to improve it. Connected devices play an increasingly vital role in major infrastructure and the daily lives of the end user. To provide our clients with peace of mind, our support agreements ensure an expert is on hand to minimise the length and damage in the event of a disruption. Catching a disruption before it occurs is always cheaper and less time consuming. WombatOAM is specifically designed for the monitoring and maintenance of BEAM-based systems (including EMQ x). This provides you with powerful visibility and custom alerts to stop issues before they occur. As well as being trusted experts in EMQ x, we also have 20 years of experience building reliable, fault-tolerant, real-time distributed systems. Our experts are able to guide you through any stage of the project to ensure your system can scale with confidence. Whether youâ€ re hunting for a suspected bug, or doing due diligence to future proof your system, weâ€ re here to help. Our world-leading team will deep dive into your system providing an in-depth report of recommendations. This gives you full visibility on the vulnerabilities of your system and how to improve it. Connected devices play an increasingly vital role in major infrastructure and the daily lives of the end user. To provide our clients with peace of mind, our support agreements ensure an expert is on hand to minimise the length and damage in the event of a disruption. Catching a disruption before it occurs is always cheaper and less time consuming. WombatOAM is specifically designed for the monitoring and maintenance of BEAM-based systems (including EMQ x). This provides you with powerful visibility and custom alerts to stop issues before they occur. Because itâ€ s written in Erlang!With itâ€ s Erlang/OTP design, EMQ X fuses some of the best qualities of Erlang. A single node broker can sustain one million concurrent connectionsâ€¦but a single EMQ X cluster â€“ which contains multiple nodes â€“ can support tens of millions of concurrent connections. Inside this cluster, routing and broker nodes are deployed independently to increase the routing efficiency. Control channels and data channels are also separated â€“ significantly improving the performance of message forwarding. EMQ X works on a soft real-time basis. No matter how many simultaneous requests are going through the system, the latency is guaranteed.Hereâ€ s how EMQ X can help with your IoT messaging needs?Erlang Solutions exists to build transformative solutions for the worldâ€ s most ambitious companies, by providing user-focused consultancy, high tech capabilities and diverse communities. Letâ€ s talk about how we can help you.”, author: “”, publishing_date: “”, title: “”, url: “https://www.erlang-solutions.com/capabilities/emqx/”}. Reason: missing required fields

The line above indicates that the spider has dropped an article, which is not an article but is a general page. We want to exclude these URLs from the route of our bot.

Try to avoid creating unnecessary loads on a website when doing crawling activities.

The following lines can achieve this:

requests =
 Floki.find(document, ".link-to-all") |> Floki.attribute("href")
 |> Enum.filter(fn url -> String.contains?(url, "/blog/") end)
 |> Enum.map(&Crawly.Utils.request_from_url/1)

Now, we can re-run the spider and see that we’re not hitting non-blog pages anymore (don’t forget to reload the spider’s code)!

This optimised our crawler, but more was needed to extract more items. (Besides other things, it’s interesting to note that we can only get 35 articles from the “Keep reading” blog, which indicates some possible directions for improving the cross-linking inside the blog itself).

Improving the extraction coverage

When looking at the possibility of extracting more items, we should try finding a better source of links. One good way to do it is by exploring the blog’s homepage, potentially with JavaScript turned off. Here is what I can see:

Sometimes you need to switch JavaScript off to see more.

As you can see, there are 14 Pages (only 12 of which are working), and every page contains nine articles. So we expect ~100–108 articles in total.

So let’s try to use this pagination as a source of new links! I have updated the init() function, so it refers the blog’s index, and also parse_item so it can use the information found there:

@impl Crawly.Spider
 def init() do
   [
     start_urls: [
       "https://www.erlang-solutions.com/blog/page/2/?pg=2",
       "https://www.erlang-solutions.com/blog/web-scraping-with-elixir/",
       "https://www.erlang-solutions.com/blog/which-companies-are-using-elixir-and-why-mytopdogstatus/"
     ]
   ]
 end

@impl Crawly.Spider
def parse_item(response) do
 {:ok, document} = Floki.parse_document(response.body)

 case String.contains?(response.request_url, "/blog/page/") do
   false -> parse_article_page(document, response.request_url)
   true -> parse_index_page(document, response.request_url)
 end
end

defp parse_index_page(document, _url) do
 index_pages =
   document
   |> Floki.find(".page a")
   |> Floki.attribute("href")
   |> Enum.map(&Crawly.Utils.request_from_url/1)

 blog_posts =
   Floki.find(document, ".grid-card__content a.btn-link")
   |> Floki.attribute("href")
   |> Enum.filter(fn url -> String.contains?(url, "/blog/") end)
   |> Enum.map(&Crawly.Utils.request_from_url/1)

   %{items: [], requests: index_pages ++ blog_posts }
end

defp parse_article_page(document, url) do
 requests =
   Floki.find(document, ".link-to-all")
   |> Floki.attribute("href")
   |> Enum.filter(fn url -> String.contains?(url, "/blog/") end)
   |> Enum.map(&Crawly.Utils.request_from_url/1)

 item = %{
   url: url,
   title: Floki.find(document, ".page-title-sm") |> Floki.text(),
   article_body: Floki.find(document, ".default-content") |> Floki.text(),
   author: Floki.find(document, ".post-info__author") |> Floki.text(),
   publishing_date: Floki.find(document, ".header-inner .post-info .post-info__item span") |> Floki.text()
 }
 %{items: [item], requests: requests}

Running it again

Now, finally, after adding all fixes, let’s reload the code and re-run the spider:

So as you can see, we have extracted 114 items, which looks quite close to what we expected!

Conclusion

Honestly speaking — running an open-source project is a complex thing. We have spent almost four years building Crawly and progressed quite a bit with the possibilities. Adding some bugs as well.

The example above shows how to run something with Elixir/Floki and a bit more complex process of debugging and fixing that sometimes appears in practice.

We want to thank Erlang Solutions for supporting the development and allocating help when needed!

The post Re-implement our first blog scrapper with Crawly 0.15.0 appeared first on Erlang Solutions .