CakePHP 4 Web scraping using PHP Simple HTML DOM Parser

This tutorial will explain how to use PHP Simple HTML DOM Parser to extract specific content from a website and published it into another website.  Web scraping is the process of extracting data or content from a website. The extracted data or content can be published on other website or save into a database. Unlike screen scraping which captures the content as an image, web scraping extracts underlying HTML code to retrieve the data or content from the website.


Download the PHP Simple HTML DOM Parser
- Main website: https://simplehtmldom.sourceforge.io/
- Download: https://sourceforge.net/projects/simplehtmldom/files/
- Require PHP 5+

Extract The Source File
The latest PHP Simple HTML DOM Parser version (when this tutorial is written) is 1.9.1. In the .zip archive, there are a few files and folders. Create simple_html_dom folder inside the vendor folder and extract the simple_html_dom.php

…\vendor\simple_html_dom\simple_html_dom.php


Embed the PHP Simple HTML DOM Parser
To embed the PHP Simple HTML DOM Parser, simply add the following code at the top of your view page where you want to display the extracted data or content. The require_once keyword is used to embed PHP code from another file.

require_once(ROOT . DS . 'vendor' . DS . 'simple_html_dom' . DS . 'simple_html_dom.php');


Source Website
The sample data to be extracted is the Malaysia Covid-19 R-Naught value from the Malaysia Ministry of Health Covid-19 official website which is accessible from:

http://covid-19.moh.gov.my


The R-Naught value is display as follows:


Google Chrome Web Scraper Extension
Basically, we need to find the HTML code that consists of the data or information that we need from the source website. However, there are several tools that can be very helpful in completing this task such as Web Scraper - Free Web Scraping extension that can be download from the Google Chrome Store: https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn

Once you’ve installed the web scraper extension, open the Google Chrome DevTools (Ctrl+Shift+I) and navigate to Web Scraper.


Navigate to ‘Create new sitemap’, then click ‘create sitemap’. Give the sitemap name: rnaught (all lowercase) and start URL: http://covid-19.moh.gov.my and click ‘Create sitemap’. Now you have the R-Naught sitemap and click ‘Add new selector’ to capture the R-Naught data. Follow the step below:
1. id – rnught
2. click the select option from the selector and hover your mouse to the targeted data or content
3. click the data or content to select
4. the selector will display the HTML code that contains the selected data
5. click done selecting. You can click data preview to check the extracted data.
6. save selector


Now you've successfully identified the selector that contains the data or content that you want. In this context, ‘strong span’ is the selector that contains the R-Naught value.

Print the Data
With the identified selector for the R-Naught data, you can simply print the data to your CakePHP website by applying the following code (use plaintext to remove the CSS styles):

<?php
$html = file_get_html('http://covid-19.moh.gov.my');
echo $html->find('strong span', 0)->plaintext;
?>


The following is an example of the R-Naught data output that has been integrated with the CakePHP 4 Covid-19 module. 


That’s all. Happy coding :)