Web Scraping with Puppeteer— News Crawler

German Reynaga
3 min readOct 27, 2019

For this article I’m not going to stop in the use of Puppeteer but in the functions that we will use. Also I’m expecting that you already have the knowledge of how to create a basic application with Node JS and Express. So let’s start.

The structure of this project is going to be the following.

app.js              # App entry point
src
└───config
| |___routes # Application endpoints
|
└───models
|
└───controllers # Logic

As you can see this structure is pretty ugly but it will be enough for now.
We will assume that your application already has an endpoint where users are going to access and see the collected information, but always you can find the explanation of the code in the gist below

In the Crawler model you will find the process of open an explorer with puppeteer and interact with the elements in the HTML DOM. When the browser is ready we use the page.evaluate method to access the DOM in the regular javascript way using the document.querySelector to get an element by its id or classes in parent elements, with this we search the required information and return a JSON to the client.

To interact with a page first we need to use the following code which will open a blank page waiting for the next action.

const page = await browser.newPage();

After that we are going to fill the blank page with a website in the same way that we usually do, using an UR, and to be sure that the page was correctly loaded and is currently working we are going to tell the application to wait for a specific element or selector like we are doing with this line.

await page.waitForSelector("[class='content']");

with this we are ready to evaluate the page to get the wanted information in text format, so inside the evaluate callback we are going to have the next few lines.

let elements = Array.from(document.querySelectorAll("[class='lr-row-news']"));// Iterate through all found items
let posts = elements.map(cont => {
// Searching elements by its parent classes
let elTitle = cont.querySelector(".title-container > .title");
let elContent = cont.querySelector(".title-container > .summary > span");
let elDateTime = cont.querySelector("[class='hour']");
let elPostLink = cont.querySelector(".title-container > .title > a");
// Return JSON with the text inside the found elements return {
title: (elTitle && elTitle.innerText) ? elTitle.innerText.replace('\n', ' ') : 'Not found',
content: (elContent && elContent.innerText) ? elContent.innerText.replace('\n', ' ') : "No content found",
created_at: (elDateTime && elDateTime.innerText) ? elDateTime.innerText.replace('\n', ' ') : "No content found",
link: (elPostLink && elPostLink.href) ? elPostLink.href : 'Not link',
provider: 'MILENIO'
}
})

And that’s all what you need to start, you will find that the code is quite commented with the necessary to help you continue your learning process.
I hope this few lines can help you to start with this amazing technology.
Have fun!

GitHub repository: https://github.com/Arquetipo28/news-crawler

--

--

German Reynaga

Full Stack web developer, passionate with education and constantly improving my background.