12/26/2024
Day Two Continued
Great! Let’s refine the script step by step as we walk through each method Scrapling offers. Along the way, we’ll clean up the output and make it more structured and advanced. Here's how we'll approach this:
---
# # # Step 1: Fetching the Web Page
The initial step is fetching the page. This step has been confirmed to work correctly, as evidenced by the log message indicating a successful HTTP 200 response. No changes are needed here.
---
# # # Step 2: CSS Selector (`page.css`)
# # # # Current Behavior
The `page.css('.quote .text::text')` method extracts all quotes as plain text. The output is readable but not structured.
# # # # Refinement
We’ll structure the output into a list of dictionaries, each containing the quote text, author, and tags. This improves readability and usability.
# # # # Updated Code
```python
# Extract quotes, authors, and tags using CSS selectors
quotes = []
for quote in page.css('.quote'):
text = quote.css_first('.text::text').text()
author = quote.css_first('.author::text').text()
tags = [tag.text() for tag in quote.css('.tag::text')]
quotes.append({"text": text, "author": author, "tags": tags})
# Print structured output
print("Extracted Quotes (Structured):")
for quote in quotes:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
```
---
# # # Step 3: XPath (`page.xpath`)
# # # # Current Behavior
The `page.xpath('//span[="text"]/text()')` method extracts only the quote text, similar to the CSS selector.
# # # # Refinement
We’ll use XPath to extract the same structured data (quote, author, tags) for comparison with the CSS method.
# # # # Updated Code
```python
# Extract quotes, authors, and tags using XPath
quotes_xpath = []
for quote in page.xpath('//div[="quote"]'):
text = quote.xpath('.//span[="text"]/text()')[0]
author = quote.xpath('.//small[="author"]/text()')[0]
tags = quote.xpath('.//div[="tags"]/a[="tag"]/text()')
quotes_xpath.append({"text": text, "author": author, "tags": tags})
# Print structured output for XPath
print("Extracted Quotes (XPath):")
for quote in quotes_xpath:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
```
---
# # # Step 4: Chained Selectors (`page.css().css()`)
# # # # Current Behavior
Chained selectors achieve the same result as `page.css('.quote .text::text')`, but they are slower and less efficient.
# # # # Refinement
Since chained selectors are redundant for this use case, we’ll skip restructuring this part but include a note in the code explaining the inefficiency.
---
# # # Step 5: `find_all`
# # # # Current Behavior
The `find_all` method returns raw HTML elements, which are less readable.
# # # # Refinement
We’ll parse the HTML elements into structured data (quote, author, tags) for consistency with the previous methods.
# # # # Updated Code
```python
# Extract quotes, authors, and tags using find_all
quotes_find_all = []
for quote in page.find_all('div', {'class': 'quote'}):
text = quote.find('span', class_='text').text()
author = quote.find('small', class_='author').text()
tags = [tag.text() for tag in quote.find_all('a', class_='tag')]
quotes_find_all.append({"text": text, "author": author, "tags": tags})
# Print structured output for find_all
print("Extracted Quotes (find_all):")
for quote in quotes_find_all:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
```
---
# # # Step 6: Raw HTML Output
# # # # Current Behavior
Raw HTML elements are printed directly. This is helpful for debugging but not for final output.
# # # # Refinement
We’ll use `.prettify()` and `.attrib` only for debugging purposes and include a toggle to enable/disable it.
# # # # Updated Code
```python
# Debug: Print first quote element's raw HTML
if debug_mode:
first_quote = page.css_first('.quote')
print("First Quote Element (Raw HTML):")
print(first_quote.prettify())
```
---
Consolidated and Refined Script
Below is the finalized version of the script, incorporating refinements for structured outputs and debugging options:
from scrapling import Fetcher
fetcher = Fetcher(auto_match=False)
# Fetch a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Get all strings in the full page
page.get_all_text(ignore_tags=('script', 'style'))
# Get all quotes, any of these methods will return a list of strings (TextHandlers)
quotes = page.css('.quote .text::text') # CSS selector
quotes = page.xpath('//span[="text"]/text()') # XPath
quotes = page.css('.quote').css('.text::text') # Chained selectors
quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above
# Get the first quote element
quote = page.css_first('.quote') # / page.css('.quote').first / page.css('.quote')[0]
# Tired of selectors? Use find_all/find
quotes = page.find_all('div', {'class': 'quote'})
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...
# Working with elements
quote.html_content # Inner HTML
quote.prettify() # Prettified version of Inner HTML
quote.attrib # Element attributes
quote.path # DOM path to element (List)
The script has been consolidated and refined into a more structured and user-friendly format. The updates include support for CSS, XPath, and `find_all` methods, as well as a debugging option for raw HTML inspection. The problem is that it doesnt work how its supposed to so we have got to keep going with this
The issue with the edits stopping the script from working could stem from subtle changes in how data is extracted or stored. Let's simplify the debugging and ensure consistent output.
I'll revise the problematic parts of the script and ensure all methods (CSS, XPath, and `find_all`) are debugged thoroughly.
Here's a revised script:
```python
from scrapling import Fetcher
fetcher = Fetcher(auto_match=False)
# Fetch a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Extract quotes using CSS selectors
quotes_css = []
try:
for quote in page.css('.quote'):
text = quote.css_first('.text::text').text() if quote.css_first('.text::text') else "N/A"
author = quote.css_first('.author::text').text() if quote.css_first('.author::text') else "N/A"
tags = [tag.text() for tag in quote.css('.tag::text')] if quote.css('.tag::text') else []
quotes_css.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with CSS extraction: {e}")
# Print CSS extraction
print("\nExtracted Quotes (CSS):")
for quote in quotes_css:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using XPath
quotes_xpath = []
try:
for quote in page.xpath('//div[="quote"]'):
text = quote.xpath('.//span[="text"]/text()')[0] if quote.xpath('.//span[="text"]/text()') else "N/A"
author = quote.xpath('.//small[="author"]/text()')[0] if quote.xpath('.//small[="author"]/text()') else "N/A"
tags = quote.xpath('.//div[="tags"]/a[="tag"]/text()') if quote.xpath('.//div[="tags"]/a[="tag"]/text()') else []
quotes_xpath.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with XPath extraction: {e}")
# Print XPath extraction
print("\nExtracted Quotes (XPath):")
for quote in quotes_xpath:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using find_all
quotes_find_all = []
try:
for quote in page.find_all('div', {'class': 'quote'}):
text = quote.find('span', class_='text').text() if quote.find('span', class_='text') else "N/A"
author = quote.find('small', class_='author').text() if quote.find('small', class_='author') else "N/A"
tags = [tag.text() for tag in quote.find_all('a', class_='tag')] if quote.find_all('a', class_='tag') else []
quotes_find_all.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with find_all extraction: {e}")
# Print find_all extraction
print("\nExtracted Quotes (find_all):")
for quote in quotes_find_all:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
```
# # # Key Fixes:
1. **Error Handling:**
- Wrap extraction loops in `try-except` blocks to catch and report errors without stopping ex*****on.
2. **Null Checks:**
- Ensure all `.text()` or `.xpath()` calls are safely checked to avoid `AttributeError` when elements are missing.
3. **Consistent Logging:**
- Print outputs for each method even if a failure occurs to help debug inconsistencies.
Run this and let me know the results!
└─$ /home/fahm/scrapling_reddit/venv/bin/python /home/fahm/scrapling_reddit/scraper.py
[2024-12-26 04:34:19] INFO: Fetched (200) (referer: https://www.google.com/search?q=toscrape)
Error with CSS extraction: 'TextHandler' object has no attribute 'text'
Extracted Quotes (CSS):
Extracted Quotes (XPath):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
Error with find_all extraction: 'TextHandler' object is not callable
Extracted Quotes (find_all):
(venv) ┌──(venv)(fahm㉿MININT-FCGQJBQ)-[~/scrapling_reddit]
└─$
The output indicates two issues:
1. **CSS Extraction Error**:
The error message `'TextHandler' object has no attribute 'text'` suggests that the `.text()` method is being incorrectly applied to a `TextHandler` object. Instead, the correct property is likely `.text` (without parentheses).
2. **`find_all` Extraction Error**:
The error `'TextHandler' object is not callable` implies that a `TextHandler` object was mistakenly treated as a callable, likely where `.text()` or `.text` was expected.
# # # Fixes for the Issues
Here’s the updated script with corrections:
```python
from scrapling import Fetcher
fetcher = Fetcher(auto_match=False)
# Fetch a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Extract quotes using CSS selectors
quotes_css = []
try:
for quote in page.css('.quote'):
text = quote.css_first('.text::text').text if quote.css_first('.text::text') else "N/A"
author = quote.css_first('.author::text').text if quote.css_first('.author::text') else "N/A"
tags = [tag.text for tag in quote.css('.tag::text')] if quote.css('.tag::text') else []
quotes_css.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with CSS extraction: {e}")
# Print CSS extraction
print("\nExtracted Quotes (CSS):")
for quote in quotes_css:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using XPath
quotes_xpath = []
try:
for quote in page.xpath('//div[="quote"]'):
text = quote.xpath('.//span[="text"]/text()')[0] if quote.xpath('.//span[="text"]/text()') else "N/A"
author = quote.xpath('.//small[="author"]/text()')[0] if quote.xpath('.//small[="author"]/text()') else "N/A"
tags = quote.xpath('.//div[="tags"]/a[="tag"]/text()') if quote.xpath('.//div[="tags"]/a[="tag"]/text()') else []
quotes_xpath.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with XPath extraction: {e}")
# Print XPath extraction
print("\nExtracted Quotes (XPath):")
for quote in quotes_xpath:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using find_all
quotes_find_all = []
try:
for quote in page.find_all('div', {'class': 'quote'}):
text = quote.find('span', class_='text').text if quote.find('span', class_='text') else "N/A"
author = quote.find('small', class_='author').text if quote.find('small', class_='author') else "N/A"
tags = [tag.text for tag in quote.find_all('a', class_='tag')] if quote.find_all('a', class_='tag') else []
quotes_find_all.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with find_all extraction: {e}")
# Print find_all extraction
print("\nExtracted Quotes (find_all):")
for quote in quotes_find_all:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
```
# # # Key Changes:
1. **CSS Selector `.text` Property**:
- Replaced `.text()` with `.text` in CSS extraction to access the property correctly.
2. **`find_all` Fixes**:
- Correctly used `.text` instead of treating `TextHandler` objects as callable.
# # # Next Steps:
Run the updated script and let me know if all methods (`CSS`, `XPath`, and `find_all`) now produce consistent outputs.
[2024-12-26 04:35:48] INFO: Fetched (200) (referer: https://www.google.com/search?q=toscrape)
Error with CSS extraction: 'TextHandler' object has no attribute 'text'
Extracted Quotes (CSS):
Extracted Quotes (XPath):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
Extracted Quotes (find_all):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
(venv) ┌──(venv)(fahm㉿MININT-FCGQJBQ)-[~/scrapling_reddit]
└─$
The updated results highlight the following:
1. **CSS Extraction Issue**:
The error `'TextHandler' object has no attribute 'text'` persists, indicating that the `.css_first` method is not returning the expected object type or the Scrapling library's behavior differs from the assumption. This needs a closer inspection of how Scrapling handles `.css_first` and what it returns.
2. **XPath and `find_all` Success**:
Both XPath and `find_all` methods are working correctly, producing consistent and accurate results.
---
# # # Debugging CSS Extraction:
Let’s investigate and address the issue with CSS extraction. Based on the Scrapling documentation and similar cases, the problem likely arises from the way `.css_first` interacts with the `::text` pseudo-element.
# # # # Suggested Fix:
Modify the CSS extraction to log the type and content of the object returned by `.css_first` for debugging. Here’s the updated section:
```python
# Extract quotes using CSS selectors
quotes_css = []
try:
for quote in page.css('.quote'):
# Debug: Print the type and content of each '.text::text' selection
text_handler = quote.css_first('.text::text')
print(f"Debug - .text::text type: {type(text_handler)}, content: {text_handler}")
text = text_handler.text if text_handler else "N/A"
author = quote.css_first('.author::text').text if quote.css_first('.author::text') else "N/A"
tags = [tag.text for tag in quote.css('.tag::text')] if quote.css('.tag::text') else []
quotes_css.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with CSS extraction: {e}")
```
---
# # # Next Steps:
1. **Run the updated code**: Check the debug output for the `type` and `content` of the `.css_first('.text::text')` object.
2. **Adjust the code**: Based on the debug findings, we’ll refine the extraction logic to correctly handle the returned object type.
Let me know the debug output, and we’ll proceed from there!
from scrapling import Fetcher
fetcher = Fetcher(auto_match=False)
# Fetch a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Extract quotes using CSS selectors
quotes_css = []
try:
for quote in page.css('.quote'):
# Debug: Print the type and content of each '.text::text' selection
text_handler = quote.css_first('.text::text')
print(f"Debug - .text::text type: {type(text_handler)}, content: {text_handler}")
text = text_handler.text if text_handler else "N/A"
author = quote.css_first('.author::text').text if quote.css_first('.author::text') else "N/A"
tags = [tag.text for tag in quote.css('.tag::text')] if quote.css('.tag::text') else []
quotes_css.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with CSS extraction: {e}")
# Print CSS extraction
print("\nExtracted Quotes (CSS):")
for quote in quotes_css:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using XPath
quotes_xpath = []
try:
for quote in page.xpath('//div[="quote"]'):
text = quote.xpath('.//span[="text"]/text()')[0] if quote.xpath('.//span[="text"]/text()') else "N/A"
author = quote.xpath('.//small[="author"]/text()')[0] if quote.xpath('.//small[="author"]/text()') else "N/A"
tags = quote.xpath('.//div[="tags"]/a[="tag"]/text()') if quote.xpath('.//div[="tags"]/a[="tag"]/text()') else []
quotes_xpath.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with XPath extraction: {e}")
# Print XPath extraction
print("\nExtracted Quotes (XPath):")
for quote in quotes_xpath:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using find_all
quotes_find_all = []
try:
for quote in page.find_all('div', {'class': 'quote'}):
text = quote.find('span', class_='text').text if quote.find('span', class_='text') else "N/A"
author = quote.find('small', class_='author').text if quote.find('small', class_='author') else "N/A"
tags = [tag.text for tag in quote.find_all('a', class_='tag')] if quote.find_all('a', class_='tag') else []
quotes_find_all.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with find_all extraction: {e}")
# Print find_all extraction
print("\nExtracted Quotes (find_all):")
for quote in quotes_find_all:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
(venv) ┌──(venv)(fahm㉿MININT-FCGQJBQ)-[~/scrapling_reddit]
└─$ /home/fahm/scrapling_reddit/venv/bin/python /home/fahm/scrapling_reddit/scraper.py
[2024-12-26 04:45:54] INFO: Fetched (200) (referer: https://www.google.com/search?q=toscrape)
Debug - .text::text type: , content: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Error with CSS extraction: 'TextHandler' object has no attribute 'text'
Extracted Quotes (CSS):
Extracted Quotes (XPath):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
Extracted Quotes (find_all):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
(venv) ┌──(venv)(fahm㉿MININT-FCGQJBQ)-[~/scrapling_reddit]
└─$
The debug output confirms that the `.css_first('.text::text')` method from Scrapling returns a `TextHandler` object. However, this object does not have a `.text` attribute, which is causing the error. Instead, we need to directly access the string value from the `TextHandler` object.
# # # Fix for CSS Extraction
The `TextHandler` object likely needs to be cast or handled differently to access the text content. Based on Scrapling's behavior, you can try using `str()` to convert the `TextHandler` object to a string.
Here’s the updated CSS extraction code:
```python
# Extract quotes using CSS selectors
quotes_css = []
try:
for quote in page.css('.quote'):
# Use str() to get the text value from TextHandler
text_handler = quote.css_first('.text::text')
text = str(text_handler) if text_handler else "N/A"
author_handler = quote.css_first('.author::text')
author = str(author_handler) if author_handler else "N/A"
tags = [str(tag) for tag in quote.css('.tag::text')] if quote.css('.tag::text') else []
quotes_css.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with CSS extraction: {e}")
# Print CSS extraction
print("\nExtracted Quotes (CSS):")
for quote in quotes_css:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
```
# # # Explanation
1. **Using `str()`**: The `TextHandler` object is converted into a string to retrieve the text content. This approach avoids calling a non-existent `.text` attribute.
2. **Improved Robustness**: The updated logic ensures that even if a specific CSS selector fails to match or returns `None`, the code gracefully handles it without raising an error.
---
# # # Next Steps
1. Replace the CSS extraction block in your script with the updated code above.
2. Run the script again and observe the output for the "Extracted Quotes (CSS)" section.
If the issue persists or if there’s another unexpected behavior, share the updated debug output!
from scrapling import Fetcher
fetcher = Fetcher(auto_match=False)
# Fetch a web page and create an Adaptor instance
page = fetcher.get('https://quotes.toscrape.com/', stealthy_headers=True)
# Extract quotes using CSS selectors
quotes_css = []
try:
for quote in page.css('.quote'):
# Use str() to get the text value from TextHandler
text_handler = quote.css_first('.text::text')
text = str(text_handler) if text_handler else "N/A"
author_handler = quote.css_first('.author::text')
author = str(author_handler) if author_handler else "N/A"
tags = [str(tag) for tag in quote.css('.tag::text')] if quote.css('.tag::text') else []
quotes_css.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with CSS extraction: {e}")
# Print CSS extraction
print("\nExtracted Quotes (CSS):")
for quote in quotes_css:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Print CSS extraction
print("\nExtracted Quotes (CSS):")
for quote in quotes_css:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using XPath
quotes_xpath = []
try:
for quote in page.xpath('//div[="quote"]'):
text = quote.xpath('.//span[="text"]/text()')[0] if quote.xpath('.//span[="text"]/text()') else "N/A"
author = quote.xpath('.//small[="author"]/text()')[0] if quote.xpath('.//small[="author"]/text()') else "N/A"
tags = quote.xpath('.//div[="tags"]/a[="tag"]/text()') if quote.xpath('.//div[="tags"]/a[="tag"]/text()') else []
quotes_xpath.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with XPath extraction: {e}")
# Print XPath extraction
print("\nExtracted Quotes (XPath):")
for quote in quotes_xpath:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
# Extract quotes using find_all
quotes_find_all = []
try:
for quote in page.find_all('div', {'class': 'quote'}):
text = quote.find('span', class_='text').text if quote.find('span', class_='text') else "N/A"
author = quote.find('small', class_='author').text if quote.find('small', class_='author') else "N/A"
tags = [tag.text for tag in quote.find_all('a', class_='tag')] if quote.find_all('a', class_='tag') else []
quotes_find_all.append({"text": text, "author": author, "tags": tags})
except Exception as e:
print(f"Error with find_all extraction: {e}")
# Print find_all extraction
print("\nExtracted Quotes (find_all):")
for quote in quotes_find_all:
print(f"- Quote: {quote['text']}")
print(f" Author: {quote['author']}")
print(f" Tags: {', '.join(quote['tags'])}")
(venv) ┌──(venv)(fahm㉿MININT-FCGQJBQ)-[~/scrapling_reddit]
└─$ /home/fahm/scrapling_reddit/venv/bin/python /home/fahm/scrapling_reddit/scraper.py
[2024-12-26 04:49:03] INFO: Fetched (200) (referer: https://www.google.com/search?q=toscrape)
Extracted Quotes (CSS):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
Extracted Quotes (CSS):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
Extracted Quotes (XPath):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
Extracted Quotes (find_all):
- Quote: “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
Author: Albert Einstein
Tags: change, deep-thoughts, thinking, world
- Quote: “It is our choices, Harry, that show what we truly are, far more than our abilities.”
Author: J.K. Rowling
Tags: abilities, choices
- Quote: “There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”
Author: Albert Einstein
Tags: inspirational, life, live, miracle, miracles
- Quote: “The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”
Author: Jane Austen
Tags: aliteracy, books, classic, humor
- Quote: “Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”
Author: Marilyn Monroe
Tags: be-yourself, inspirational
- Quote: “Try not to become a man of success. Rather become a man of value.”
Author: Albert Einstein
Tags: adulthood, success, value
- Quote: “It is better to be hated for what you are than to be loved for what you are not.”
Author: André Gide
Tags: life, love
- Quote: “I have not failed. I've just found 10,000 ways that won't work.”
Author: Thomas A. Edison
Tags: edison, failure, inspirational, paraphrased
- Quote: “A woman is like a tea bag; you never know how strong it is until it's in hot water.”
Author: Eleanor Roosevelt
Tags: misattributed-eleanor-roosevelt
- Quote: “A day without sunshine is like, you know, night.”
Author: Steve Martin
Tags: humor, obvious, simile
(venv) ┌──(venv)(fahm㉿MININT-FCGQJBQ)-[~/scrapling_reddit]
└─$
Call now to connect with business.