Web scraping using PHP Simple HTML DOM Parser

Asyraf Wahi Anuar - June 04, 2021

Published in Misc 5532 Views Email This Article
Estimated reading time: 2 minutes, 49 seconds

This tutorial will explain how to use PHP Simple HTML DOM Parser to extract specific content from a website and publish it into another website. Web scraping is the process of extracting data or content from a website. The extracted data or content can be published on other websites or saved into a database. Unlike screen scraping which captures the content as an image, web scraping extracts underlying HTML code to retrieve the data or content from the website.

Download the PHP Simple HTML DOM Parser
- Main website: https://simplehtmldom.sourceforge.io/
- Download: https://sourceforge.net/projects/simplehtmldom/files/
- Require PHP 5+

Extract The Source File
The latest PHP Simple HTML DOM Parser version (when this tutorial is written) is 1.9.1. In the .zip archive, there are a few files and folders. Create simple_html_dom folder inside the vendor folder and extract the simple_html_dom.php

…\vendor\simple_html_dom\simple_html_dom.php

Embed the PHP Simple HTML DOM Parser
To embed the PHP Simple HTML DOM Parser, simply add the following code at the top of your view page where you want to display the extracted data or content. The require_once keyword is used to embed PHP code from another file.

require_once(ROOT . DS . 'vendor' . DS . 'simple_html_dom' . DS . 'simple_html_dom.php');

Source Website
The sample data to be extracted is the Malaysia Covid-19 R-Naught value from the Malaysia Ministry of Health Covid-19 official website which is accessible from:

http://covid-19.moh.gov.my

The R-Naught value is displayed as follows:

Tutorial

Google Chrome Web Scraper Extension
Basically, we need to find the HTML code that consists of the data or information that we need from the source website. However, there are several tools that can be very helpful in completing this task such as Web Scraper - Free Web Scraping extension that can be downloaded from the Google Chrome Store:

https://chrome.google.com/webstore/detail/web-scraper-free-web-scra/jnhgnonknehpejjnehehllkliplmbmhn

Once you’ve installed the web scraper extension, open the Google Chrome DevTools (Ctrl+Shift+I) and navigate to Web Scraper.

Navigate to ‘Create new sitemap’, then click ‘create sitemap’. Give the sitemap name: rnaught (all lowercase) and start URL: http://covid-19.moh.gov.my and click ‘Create sitemap’. Now you have the R-Naught sitemap and click ‘Add new selector’ to capture the R-Naught data. Follow the step below:
1. id – rnught
2. click the select option from the selector and hover your mouse to the targeted data or content
3. click the data or content to select
4. the selector will display the HTML code that contains the selected data
5. click done selecting. You can click data preview to check the extracted data.
6. save selector

Now you've successfully identified the selector that contains the data or content that you want. In this context, ‘strong span’ is the selector that contains the R-Naught value.

Print the Data
With the identified selector for the R-Naught data, you can simply print the data to your CakePHP website by applying the following code (use plaintext to remove the CSS styles):

<?php
$html = file_get_html('http://covid-19.moh.gov.my');
echo $html->find('strong span', 0)->plaintext;
?>

The following is an example of the R-Naught data output that has been integrated with the CakePHP 4 Covid-19 module.

That’s all. Happy coding :)

Cite this article (APA 6th Edition)

Latest Posting

Handling Irregular Plural Table Names in CakePHP

Using Bahasa Melayu Table Names in CakePHP

Upgrade PHP and Apache in Laragon 6: A Step-by-Step Tutorial

Git Tutorial: Uploading a Project to GitHub and Collaborating with Others

Upgrade from Joomla 4 to Joomla 5: A Step-by-Step Tutorial

Boost Your Productivity with Emmet in VS Code

Git Pull Error in cPanel

Formatting PHP in VSCode Using Prettier and PHP Intelephense

Dark Mode in Bootstrap 5.3.3

CakePHP 4 Soft Delete

CakePHP 4 Application Programming Interface (API)

CakePHP 4 Create TreeMap and HeatMap Chart Using ApexCharts

</> Code The Pixel

Web scraping using PHP Simple HTML DOM Parser