CSV Processing and Web Scraping – Dr. Trippers Stuff

This code is not doing what is needed. I want the first step to ‘clean the sheet’ of anything not in the Description column by taking the first words, up to the comma, and placing them in another file. from there we will then read the new file and research the products one at a time with a slight delay in requests.::import requests
import csv
from bs4 import BeautifulSoup

def process_item_file(input_filename, processed_filename):
“””
Process the input CSV file and extract the description up to the first comma.
Write the cleaned descriptions into a new CSV file.
“””
with open(input_filename, mode=’r’, newline=”, encoding=’utf-8′) as infile:
reader = csv.reader(infile)
descriptions = [row[0] for row in reader if row] # Read descriptions

with open(processed_filename, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
writer = csv.writer(outfile)
for description in descriptions:
cleaned_description = description.split(‘,’)[0] # Take only the part before the first comma
writer.writerow([cleaned_description]) # Write the cleaned description

def get_item_name_and_products(description):
“””
Search eBay for the cleaned description and extract the first 3 products’ details.
“””
search_query = description.split(‘,’)[0] # Use everything before the first comma as the search term

# Construct eBay search URL
url = f”https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2334524.m570.l1313&_nkw={search_query.replace(‘ ‘, ‘+’)}&_sacat=0″

try:
# Send GET request to eBay URL with a timeout (e.g., 10 seconds)
response = requests.get(url, timeout=10)
response.raise_for_status() # Raise an HTTPError if the response was not successful

# Parse the HTML content of the page
soup = BeautifulSoup(response.content, ‘html.parser’)

# Find the product listings
product_items = soup.find_all(‘li’, class_=’s-item’)[:3] # Get first 3 products

products = []

for product in product_items:
title = product.find(‘h3′, class_=’s-item__title’)
price = product.find(‘span’, class_=’s-item__price’)
link = product.find(‘a’, class_=’s-item__link’)

if title and price and link:
product_info = {
‘Title’: title.get_text(),
‘Price’: price.get_text(),
‘Link’: link[‘href’]
}
products.append(product_info)

return products

except requests.exceptions.Timeout:
print(f”Timeout occurred while researching: {description}”)
return []
except requests.exceptions.RequestException as e:
print(f”Error occurred while researching {description}: {e}”)
return []

def process_csv(input_filename, output_filename):
“””
Process the new cleaned CSV file (processed_items.csv) to scrape eBay products
and save the results to the output file.
“””
with open(input_filename, mode=’r’, newline=”, encoding=’utf-8′) as infile:
reader = csv.reader(infile)
descriptions = [row[0] for row in reader if row] # Read the descriptions

# Open output CSV to append the results
with open(output_filename, mode=’a’, newline=”, encoding=’utf-8′) as outfile:
fieldnames = [‘Title’, ‘Price’, ‘Link’] # Results fields
writer = csv.DictWriter(outfile, fieldnames=fieldnames)

# Write the header only if the file is empty
if outfile.tell() == 0:
writer.writeheader()

# Loop through each cleaned description, research it, and write results to CSV
for description in descriptions:
print(f”Researching description: {description}”)
products = get_item_name_and_products(description)

# Write the results for each product
for product in products:
writer.writerow(product)
print(f”Results for {description.split(‘,’)[0]} saved.”) # Print only the part before the comma

# Specify input and output CSV file names
input_csv = ‘items.csv’ # This file should contain the list of descriptions in the first column
processed_csv = ‘processed_items.csv’ # Intermediate file with descriptions up to the first comma
output_csv = ‘items_with_results.csv’ # The results will be appended here

# Step 1: Process the input file to create the new CSV with cleaned descriptions
process_item_file(input_csv, processed_csv)

# Step 2: Process the cleaned file and scrape eBay results
process_csv(processed_csv, output_csv)
The code structure is good, but there are some improvements and clarifications needed to ensure it achieves the desired functionality effectively. Below is a refined version of your code with some changes and additions:

1. **Refinement of Step 1: “Clean the sheet”**
Ensure the function extracts only the necessary descriptions from the “Description” column and handles edge cases.

2. **Processing the cleaned file:**
Add proper handling for delays to respect eBay’s request policies.

3. **General Code Improvements:**
Use proper error handling, logging, and delays for web scraping. Also, ensure the description extraction process doesn’t inadvertently skip or include unnecessary entries.

Here’s the revised code:

“`python
# File paths for clarity
input_csv = ‘items.csv’ # Original CSV file
processed_csv = ‘processed_items.csv’ # Intermediate file
output_csv = ‘items_with_results.csv’ # Final results file

import csv
import time
import requests
from bs4 import BeautifulSoup

def process_item_file(input_filename, processed_filename):
“””
Extract the first part of descriptions (up to the first comma) from the input file
and save them in the processed file.
“””
with open(input_filename, mode=’r’, newline=”, encoding=’utf-8′) as infile:
reader = csv.reader(infile)
cleaned_rows = []
for row in reader:
if row and row[0]: # Check if the row exists and is non-empty
cleaned_description = row[0].split(‘,’)[0].strip() # Extract up to the first comma
if cleaned_description: # Skip empty entries
cleaned_rows.append([cleaned_description])

with open(processed_filename, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
writer = csv.writer(outfile)
writer.writerows(cleaned_rows)

def get_item_name_and_products(description):
“””
Search eBay for the cleaned description and extract the first 3 products’ details.
“””
query = description.replace(‘ ‘, ‘+’) # Prepare query for eBay search URL
url = f”https://www.ebay.com/sch/i.html?_nkw={query}”

try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, ‘html.parser’)
products = []

product_items = soup.find_all(‘li’, class_=’s-item’, limit=3)
for product in product_items:
title_elem = product.find(‘h3′, class_=’s-item__title’)
price_elem = product.find(‘span’, class_=’s-item__price’)
link_elem = product.find(‘a’, class_=’s-item__link’)

if title_elem and price_elem and link_elem:
products.append({
‘Title’: title_elem.text.strip(),
‘Price’: price_elem.text.strip(),
‘Link’: link_elem[‘href’].strip()
})
return products
except requests.exceptions.RequestException as e:
print(f”Error fetching data for {description}: {e}”)
return []

def process_csv(input_filename, output_filename):
“””
Read the processed descriptions file, search eBay, and save the results.
“””
with open(input_filename, mode=’r’, newline=”, encoding=’utf-8′) as infile:
reader = csv.reader(infile)
descriptions = [row[0] for row in reader if row] # Extract descriptions

with open(output_filename, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
fieldnames = [‘Title’, ‘Price’, ‘Link’]
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()

for description in descriptions:
print(f”Searching for: {description}”)
products = get_item_name_and_products(description)
for product in products:
writer.writerow(product)
time.sleep(2) # Add a delay to avoid overwhelming the eBay server

# Workflow execution
process_item_file(input_csv, processed_csv) # Step 1: Clean the descriptions
process_csv(processed_csv, output_csv) # Step 2: Scrape eBay products
“`

### Key Improvements:
1. **Edge Case Handling:**
– Checked for empty rows and skipped them while processing.
– Avoided appending empty descriptions to the intermediate file.

2. **Delay Between Requests:**
– Introduced a `time.sleep(2)` call in the `process_csv` function to reduce request frequency.

3. **Refined Parsing Logic:**
– Limited product extraction to the first 3 valid items (`limit=3` in `find_all`).

4. **Logging and Error Handling:**
– Captured and logged any network errors or HTML parsing issues during the scraping process.

### Suggestions:
**a.** Add unit tests to validate the functions, especially `process_item_file`.
**b.** Enhance logging by saving errors to a log file for better debugging.
This needs to point to column 4 for the description. # File paths for clarity
input_csv = ‘items.csv’ # Original CSV file
processed_csv = ‘processed_items.csv’ # Intermediate file
output_csv = ‘items_with_results.csv’ # Final results file

import csv
import time
import requests
from bs4 import BeautifulSoup

with open(processed_filename, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
writer = csv.writer(outfile)
writer.writerows(cleaned_rows)

try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, ‘html.parser’)
products = []

# Workflow execution
process_item_file(input_csv, processed_csv) # Step 1: Clean the descriptions
process_csv(processed_csv, output_csv) # Step 2: Scrape eBay products
To update your script to extract the description from column 4 (index 3 in zero-based indexing), you’ll need to adjust the row index access in the `process_item_file` function. Here’s the updated code:

import csv
import time
import requests
from bs4 import BeautifulSoup

def process_item_file(input_filename, processed_filename):
“””
Extract the first part of descriptions (up to the first comma) from column 4 of the input file
and save them in the processed file.
“””
with open(input_filename, mode=’r’, newline=”, encoding=’utf-8′) as infile:
reader = csv.reader(infile)
cleaned_rows = []
for row in reader:
if row and len(row) > 3: # Ensure the row has at least 4 columns
cleaned_description = row[3].split(‘,’)[0].strip() # Extract up to the first comma from column 4
if cleaned_description: # Skip empty entries
cleaned_rows.append([cleaned_description])

with open(processed_filename, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
writer = csv.writer(outfile)
writer.writerows(cleaned_rows)

try:
response = requests.get(url, timeout=10)
response.raise_for_status()
soup = BeautifulSoup(response.content, ‘html.parser’)
products = []

# Workflow execution
process_item_file(input_csv, processed_csv) # Step 1: Clean the descriptions
process_csv(processed_csv, output_csv) # Step 2: Scrape eBay products
“`

### Changes Made:
1. **Column Index Updated:**
In the `process_item_file` function, changed `row[0]` to `row[3]` to extract from the fourth column (index 3).

2. **Validation for Column Existence:**
Added a check (`len(row) > 3`) to ensure the row has enough columns before accessing column 4.

3. **Documentation Update:**
Updated the docstring in `process_item_file` to reflect the use of column 4.

### Suggestions:
**a.** Run the script with a test CSV file to verify that the description extraction and further steps are working as expected.
**b.** Add a fallback mechanism to log or handle rows without enough columns gracefully.
Cannot read properties of undefined (reading ‘length’)
The error “Cannot read properties of undefined (reading ‘length’)” suggests that the code is attempting to access a property of an undefined or null object, likely because a row or object in the data does not exist as expected.

In the context of the provided Python script, this issue is not possible with `csv.reader` because Python would throw an `IndexError` if the row does not have enough elements. However, if you are seeing this issue elsewhere or if the data being processed has rows that are unexpectedly empty, the problem might be due to invalid CSV data or incorrect assumptions about the structure of the rows.

### Debugging and Fix
1. **Check Row Existence and Length Safely**
Before accessing elements in a row, ensure the row exists and contains the expected number of columns.

2. **Improve Validation in `process_item_file`**
Update the function to handle cases where rows might be empty or malformed.

Here’s the updated code snippet for better error handling:

“`python
def process_item_file(input_filename, processed_filename):
“””
Extract the first part of descriptions (up to the first comma) from column 4 of the input file
and save them in the processed file.
“””
with open(input_filename, mode=’r’, newline=”, encoding=’utf-8′) as infile:
reader = csv.reader(infile)
cleaned_rows = []
for row in reader:
if row and len(row) > 3 and row[3]: # Check if row exists, has at least 4 columns, and column 4 is not empty
cleaned_description = row[3].split(‘,’)[0].strip() # Extract up to the first comma from column 4
if cleaned_description: # Skip rows where column 4 is empty
cleaned_rows.append([cleaned_description])

with open(processed_filename, mode=’w’, newline=”, encoding=’utf-8′) as outfile:
writer = csv.writer(outfile)
writer.writerows(cleaned_rows)
“`

### Key Updates:
1. **`row and len(row) > 3 and row[3]`**
– Checks that:
– `row` exists (is not `None` or empty).
– The row has at least 4 columns (`len(row) > 3`).
– The fourth column (`row[3]`) is not empty.

2. **Graceful Handling:**
If the row is malformed, the loop skips it without trying to process it, avoiding potential errors.

### Steps to Debug Further:
– **Print Debug Statements:** Add debug prints to log the rows being processed:
“`python
for row in reader:
print(f”Processing row: {row}”)
# Add your checks here
“`
– **Check the Input CSV File:** Ensure the CSV file is well-formed and does not contain rows with fewer columns than expected.

### Suggestions:
**a.** Add robust logging to track rows skipped due to missing or invalid data.
**b.** Validate the CSV file structure manually or programmatically before running the script.