Python Web Scraping Exercises

Which Python library is commonly used to extract data from HTML and XML pages?

NumPy

BeautifulSoup

Pandas

Matplotlib

BeautifulSoup is a popular Python library for parsing HTML and XML documents. It allows you to navigate, search, and modify the parse tree easily. While NumPy and Pandas are used for numerical and tabular data processing, and Matplotlib is used for plotting, they are not designed for extracting data from web pages.

Which Python library is typically used to make HTTP requests to access web pages for scraping?

Requests

Matplotlib

Tkinter

Seaborn

The Requests library is the most commonly used Python library for sending HTTP requests to websites. It allows you to easily retrieve the HTML content of a web page, which can then be processed using libraries like BeautifulSoup or lxml. While Matplotlib and Seaborn are used for data visualization, and Tkinter is for GUI development, Requests specifically handles web communication and data fetching, making it essential for web scraping tasks.

Which method of a `BeautifulSoup` object is used to find the first occurrence of a specific HTML tag?

find_all()

get_text()

find()

select()

The find() method in BeautifulSoup is used to locate the first occurrence of a specific HTML tag in a parsed document. If you want to find all occurrences, you would use find_all(). The get_text() method extracts text content from tags, and select() uses CSS selectors to find elements. Using find() is efficient when you only need the first matching element, which is common in scraping tasks like extracting a page title or a single link.

What is the main purpose of the `lxml` library in Python web scraping?

What is the main purpose of the lxml library in Python web scraping?

Parsing and navigating HTML or XML documents

Creating graphical plots of scraped data

Managing databases for scraped information

The lxml library is a powerful Python library used for parsing HTML and XML documents. It allows web scrapers to efficiently navigate the structure of web pages, extract data, and handle large documents quickly. While libraries like Requests are used to fetch web pages, and plotting or database management libraries handle visualization or storage, lxml focuses on parsing and extracting structured data, often in combination with BeautifulSoup for easier manipulation.

When scraping a website, which HTTP method is most commonly used to retrieve page content?

POST

DELETE

GET

PUT

The GET method is the standard HTTP request used to retrieve data from a web server, making it the most commonly used method in web scraping. When you send a GET request using libraries like Requests, the server responds with the page’s HTML content, which can then be parsed and processed. Other methods like POST are used to submit data, PUT to update resources, and DELETE to remove resources; they are less common in basic scraping scenarios. Using GET ensures you can access the page content safely and efficiently.

What will the following Python web scraping code print?

from bs4 import BeautifulSoup

html_doc = """
<html>
  <body>
    <p class="title">SolutionBazz</p>
    <p class="title">Python Exercises</p>
  </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')
result = soup.find('p', class_='title').get_text()
print(result)

['SolutionBazz', 'Python Exercises']

SolutionBazz

Python Exercises

None

The find() method in BeautifulSoup returns only the first matching element in the parsed HTML document. In this example, there are two <p> tags with the class "title". The first one contains "SolutionBazz", and the second contains "Python Exercises". Since find() only retrieves the first occurrence, get_text() extracts "SolutionBazz". To get both values, find_all() would need to be used instead.

Which of the following methods in `BeautifulSoup` uses CSS selectors to locate elements?

find()

find_all()

get_text()

select()

The select() method in BeautifulSoup allows you to find elements using CSS selectors, making it a powerful and flexible option for targeting specific elements based on tag names, classes, IDs, and more complex selectors. While find() and find_all() search by tag name or attributes directly, get_text() is used for extracting text from an element, not locating it. Using select() can simplify queries when working with pages that have complex nested structures.

In Python’s `requests` library, which attribute of the response object returns the HTML content as a Unicode string?

status_code

json()

content

text

When you use Python’s requests.get() method to fetch a web page, the returned response object contains several attributes. The .text attribute provides the page content decoded as a Unicode string, which is suitable for feeding into a parser like BeautifulSoup. The .content attribute, on the other hand, returns the raw bytes, status_code gives the HTTP status, and json() parses the response as JSON if applicable. Using .text is the most common way to get clean, readable HTML for scraping purposes.

Which HTTP status code indicates that a web page request was successful and the server returned the requested content?

200

301

404

500

An HTTP status code of 200 OK means the request was successful and the server responded with the requested content, making it the most desired outcome for web scraping. A 301 indicates a permanent redirect, 404 means the page was not found, and 500 is a server error. When writing scrapers, it’s good practice to check response.status_code before attempting to parse the page, to avoid processing invalid or incomplete data.

When scraping websites that load data dynamically using JavaScript, which Python library is often used to automate a browser and retrieve the fully rendered page?

lxml

Selenium

BeautifulSoup

Requests

Selenium is a Python library that can automate real web browsers like Chrome or Firefox, allowing you to interact with web pages as if you were a human user. This makes it possible to scrape content that is loaded dynamically using JavaScript, which cannot be fetched by simply requesting the raw HTML with Requests. While lxml and BeautifulSoup are excellent HTML parsers, they can only process static content, and Requests retrieves HTML without executing JavaScript. Selenium bridges this gap by running the actual browser engine.

Which of the following functions in Python’s `time` module is often used to pause a scraper between requests to avoid overwhelming a server?

time_now()

delay()

sleep()

pause()

The sleep() function from Python’s built-in time module is commonly used in web scraping scripts to introduce delays between requests. This helps prevent sending too many requests in a short period, which could get your IP blocked or flagged by the website. For example, time.sleep(2) pauses execution for 2 seconds. Functions like time_now(), delay(), and pause() do not exist in the time module, making sleep() the correct choice.

When parsing HTML with `BeautifulSoup`, which parser is generally the fastest among the commonly used options?

html.parser

xml

lxml

html5lib

Among the popular parsers supported by BeautifulSoup — html.parser, lxml, and html5lib — the lxml parser is usually the fastest. It is implemented in C and optimized for speed, making it ideal for processing large HTML documents quickly. html.parser is built into Python but is slower and less lenient with malformed HTML, while html5lib is the most forgiving but significantly slower. The xml parser is specialized for XML documents rather than HTML. Choosing the right parser can significantly improve scraping performance.

If you want to scrape data from multiple pages of a site efficiently, which Python library is best suited for running requests concurrently?

threading

asyncio

multiprocessing

concurrent.futures

The concurrent.futures module provides a high-level interface for running tasks concurrently, making it excellent for scraping multiple pages in parallel. By using ThreadPoolExecutor for I/O-bound tasks like web requests, you can significantly reduce scraping time compared to running requests sequentially. While threading and asyncio are also valid approaches, concurrent.futures is simpler to implement and more beginner-friendly. multiprocessing is better suited for CPU-bound tasks rather than network-bound operations like scraping.

What will the following code print?

import requests
from bs4 import BeautifulSoup

html = "<div><p class='msg'>Hello</p><p class='msg'>World</p></div>"
soup = BeautifulSoup(html, 'lxml')
print(soup.find_all('p', class_='msg')[1].text)

Hello

World

['Hello', 'World']

msg

In this code:

find_all('p', class_='msg') returns a list of all <p> tags with the class msg.
The [1] index accesses the second <p> tag, which contains "World".
.text extracts only the inner text of that element.

So the output is:

World

The other options are incorrect because Hello is the first element, ['Hello', 'World'] would require no indexing, and msg refers to the class name, not the text.

When scraping a page with `BeautifulSoup`, how can you extract the value of an HTML attribute, such as the `href` of a link?

element.href()

element['href']

element.get_href()

element.get('link')

In BeautifulSoup, once you have selected an element, you can access an attribute value using dictionary-style syntax. For example:

from bs4 import BeautifulSoup

html = '<a href="https://solutionbazz.com">Visit</a>'
soup = BeautifulSoup(html, 'lxml')
link = soup.find('a')
print(link['href'])  # Output: https://solutionbazz.com

Here, link['href'] retrieves the href attribute of the <a> tag. Other options are invalid because element.href() and element.get_href() are not BeautifulSoup methods, and element.get('link') would return None since there is no attribute named link.

Consider the following code snippet:

from bs4 import BeautifulSoup

html = "<ul><li>1</li><li>2</li><li>3</li></ul>"
soup = BeautifulSoup(html, 'lxml')
print(soup.find('li').find_next_sibling().text)

What will it print?

2

1

3

None

soup.find('li') finds the first <li> element, which contains "1".
.find_next_sibling() returns the next sibling element at the same level in the HTML, which is the second <li> containing "2".
.text extracts the inner text, giving "2".

This is tricky because beginners often assume find() returns all elements or that .find_next_sibling() would skip elements. Understanding sibling navigation is crucial in scraping structured HTML.

Which of the following `BeautifulSoup` expressions correctly selects all `<p>` tags inside a `<div>`?

soup.find_all('div.p')

soup.get('div.p')

soup.find('div').text('p')

soup.select('div > p'

Option 4 uses CSS selector syntax with .select(). div > p selects all <p> tags that are direct children of <div>.
Option 1 is invalid syntax for find_all().
Option 3 attempts to call .text('p'), which doesn’t exist.
Option 2 uses .get() incorrectly; .get() only retrieves attribute values, not elements.
```
from bs4 import BeautifulSoup

html = "<div><p>A</p><p>B</p></div>"
soup = BeautifulSoup(html, 'lxml')
print([p.text for p in soup.select('div > p')])
# Output: ['A', 'B']
```
This question is tricky because many assume find_all('div.p') works like CSS selectors, but find_all does not parse CSS syntax.

Which Python statement will correctly retrieve all `<a>` tag URLs from the following HTML?

<div>
  <a href="https://site1.com">Site1</a>
  <a href="https://site2.com">Site2</a>
</div>

[link.href for link in soup.find_all('a')]

[link['href'] for link in soup.find_all('a')]

[link.get('url') for link in soup.find('a')]

soup.select('a.text')

soup.find_all('a') returns a list of all <a> elements.

Using dictionary-style access link['href'] retrieves the href attribute from each <a> tag.

from bs4 import BeautifulSoup

html = """
<div>
  <a href="https://site1.com">Site1</a>
  <a href="https://site2.com">Site2</a>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
urls = [link['href'] for link in soup.find_all('a')]
print(urls)
# Output: ['https://site1.com', 'https://site2.com']

Option 1 is incorrect because .href is not valid in BeautifulSoup.
Option 3 uses .get('url'), which does not exist in the HTML.
Option 4 tries to select by a class text, which doesn’t exist.

Given the HTML below, which `BeautifulSoup` code snippet extracts the text `"Python Exercises"`?

<div class="content">
  <h1>Welcome</h1>
  <p class="lesson">Python Exercises</p>
</div>

soup.find('div').text('lesson')

soup.find('p', class_='lesson').get('text')

soup.select('div > h1')[0].text

soup.find('p', class_='lesson').text

soup.find('p', class_='lesson') selects the <p> tag with class "lesson".
.text extracts the inner text of that element, giving "Python Exercises".

Example:

from bs4 import BeautifulSoup

html = '<div class="content"><h1>Welcome</h1><p class="lesson">Python Exercises</p></div>'
soup = BeautifulSoup(html, 'lxml')
print(soup.find('p', class_='lesson').text)
# Output: Python Exercises

Option 1 is invalid syntax, .text('lesson') doesn’t exist.
Option 2 incorrectly uses .get('text').
Option 3 selects the <h1> tag, not the <p> tag we need.

What will the following code print?

from bs4 import BeautifulSoup

html = "<div><span>One</span><span>Two</span><span>Three</span></div>"
soup = BeautifulSoup(html, 'lxml')
spans = soup.find_all('span')
print(spans[-1].text)

Three

One

Two

['One', 'Two', 'Three']

soup.find_all('span') returns a list of all <span> tags: [<span>One</span>, <span>Two</span>, <span>Three</span>].
Using spans[-1] accesses the last element of the list, which contains "Three".
.text extracts the inner text of the element.

Example:

from bs4 import BeautifulSoup

html = "<div><span>One</span><span>Two</span><span>Three</span></div>"
soup = BeautifulSoup(html, 'lxml')
spans = soup.find_all('span')
print(spans[-1].text)  # Output: Three

Python Web Scraping Exercises

Which Python library is commonly used to extract data from HTML and XML pages?

Which Python library is typically used to make HTTP requests to access web pages for scraping?

Which method of a BeautifulSoup object is used to find the first occurrence of a specific HTML tag?

What is the main purpose of the lxml library in Python web scraping?

When scraping a website, which HTTP method is most commonly used to retrieve page content?

What will the following Python web scraping code print?

Which of the following methods in BeautifulSoup uses CSS selectors to locate elements?

In Python’s requests library, which attribute of the response object returns the HTML content as a Unicode string?

Which HTTP status code indicates that a web page request was successful and the server returned the requested content?

When scraping websites that load data dynamically using JavaScript, which Python library is often used to automate a browser and retrieve the fully rendered page?

Which of the following functions in Python’s time module is often used to pause a scraper between requests to avoid overwhelming a server?

When parsing HTML with BeautifulSoup, which parser is generally the fastest among the commonly used options?

If you want to scrape data from multiple pages of a site efficiently, which Python library is best suited for running requests concurrently?

What will the following code print?

When scraping a page with BeautifulSoup, how can you extract the value of an HTML attribute, such as the href of a link?

Consider the following code snippet:

What will it print?

Which of the following BeautifulSoup expressions correctly selects all <p> tags inside a <div>?

Which Python statement will correctly retrieve all <a> tag URLs from the following HTML?

Given the HTML below, which BeautifulSoup code snippet extracts the text "Python Exercises"?

What will the following code print?

About This Exercise: Python – Web Scraping

Which method of a `BeautifulSoup` object is used to find the first occurrence of a specific HTML tag?

What is the main purpose of the `lxml` library in Python web scraping?

Which of the following methods in `BeautifulSoup` uses CSS selectors to locate elements?

In Python’s `requests` library, which attribute of the response object returns the HTML content as a Unicode string?

Which of the following functions in Python’s `time` module is often used to pause a scraper between requests to avoid overwhelming a server?

When parsing HTML with `BeautifulSoup`, which parser is generally the fastest among the commonly used options?

When scraping a page with `BeautifulSoup`, how can you extract the value of an HTML attribute, such as the `href` of a link?

Which of the following `BeautifulSoup` expressions correctly selects all `<p>` tags inside a `<div>`?

Which Python statement will correctly retrieve all `<a>` tag URLs from the following HTML?

Given the HTML below, which `BeautifulSoup` code snippet extracts the text `"Python Exercises"`?