Magren

Magren

Idealist & Garbage maker 🛸
twitter
jike

Puppeteer Crawling Github Guide

Introduction#

Puppeteer is a headless Chrome node library produced by Chrome. It provides a series of APIs that can call Chrome's functions without a UI, suitable for various scenarios such as web scraping and automation processing.

Usage#

Installation#

npm install puppeteer-chromium-resolver --save

Launch/Close Browser#

 const browser = await (puppeteer.launch({
        args: ['--no-sandbox', 
        '--disable-setuid-sandbox'],
        // If accessing an https page, this property will ignore https errors
        ignoreHTTPSErrors: true,
        headless: true, // Set to true for headless mode, which does not display the browser and runs Chrome in a headless environment
  }));

  // Close the browser
  browser.close();

Create a New Tab and Navigate#

const page = await browser.newPage();
await page.goto('https://github.com/'+name+'?tab=repositories'); // Navigate to the repositories page of a specific user on GitHub

Execute Functions in the Console (evaluate())#

// Get all project URLs on the current page, and also get the URL of the next repository if available
 const rep =  await page.evaluate(()=>{
        const url = document.querySelectorAll('.wb-break-all > a');
        const next = document.querySelector('.BtnGroup > a');
        let urlList = undefined;
        let nextUrl = undefined;
        if(url != null){
            urlList = Array.prototype.map.call(url,(item)=>{
                return item.href;
            })
        }
        if(next!=null&&next.outerText==='Next'){
            nextUrl = next.href;
        }
        return {
            urlList:urlList,
            nextUrl:nextUrl
        }
  });

Get Page Elements#

const el = await page.$(selector)

Click on an Element#

await el.click()

Enter Content#

await el.type(text)

Scraping Data from Github#

Using express, I scraped the number of followers for a specified user, as well as the dates and number of commits made on each day, and the URL, project name, commit count, and star count for each public project. The data is then returned as a JSON object.

Project link: getGithub

Example usage:

http://localhost:4000/getAllContributions/Magren0321

Example response:
test.png

Challenges Encountered#

When scraping the number of contributions for each year, I encountered difficulties in retrieving the attributes of the elements because the data-date and data-count attributes were custom attributes defined by GitHub and could not be directly accessed. In this case, the getAttribute() method had to be used.

There are also setAttribute() and removeAttribute() methods in Attributes, which are used to set and remove attributes that do not exist in the node prototype.

// Get the dates and number of commits for each day
async function getDateList(yearData,page){
    await page.goto(yearData);
    const dateList = await page.evaluate(()=>{
        const date = document.querySelectorAll('.ContributionCalendar-day');
        const datelist = [];
        for(let item of date){
            if(item.getAttribute('data-count')!=0&&item.getAttribute('data-count')!=null){
                datelist.push({
                    data_date: item.getAttribute('data-date'),
                    data_count: item.getAttribute('data-count')
                })
            }
        }
      
        return datelist;
    })
    return dateList;
}

There are also some strange things about the Github API that I won't summarize here......
For example, it returns an empty array, but when I output its length, it turns out to be 2, and when I make a direct request, I find that there are two line breaks inside 😵
I can only test with more data 😔

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.