Quantcast
Channel: Active questions tagged excel - Stack Overflow
Viewing all articles
Browse latest Browse all 88066

How can I optimize my web-scraper for 40k+ links

$
0
0

With list of 40k links from which have to scrap emails/company names - takes forever the way I made it.

  1. All websites look the same but they are lacking id's or any proper order/classification. Thus I have to search using regex for the email (or so I think) . For the rest it's easy, them being in the same order/place. In case I use it on a proper website, how could it be faster?

  2. Afterwards, I add everything in a dictionary - because like this if the email repeats, it doesn't append. But then I don't know how to append everything nicely into a excel file. Right now it's split into 2 columns. Would be nice -> each their own column. And I also feel this is sluggish and slow.

  3. Would it be faster to write everything after each find? in case the code breaks. Or it would just extend the time.

  4. The main issue is that there is 1 email - multiple company's and 10x more domains. Is there a way to have rows like this?
    email - company names - domains
    email - company names - domains
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook
import requests
import re
import csv
from datetime import datetime
import time

def details_pag_antillephone(URL, restrans, detalii):
    page = requests.get(URL)
    soup = BeautifulSoup(page.content, 'html.parser')
    results = soup.find('div', class_=restrans)
    detalii = results.find_all('div', class_=detalii)
    return detalii, soup


dataDic = {}
links = []

df = pd.read_excel(r'C:\Users\Adrian\Desktop\40k_links.xlsx')
links = df['Unnamed: 7'].tolist()

start_time = datetime.now()
for i in links[10000:15000]: # Run 5000 intervals, takes too long otherwise
    if i != float:  # sometimes there are floats in the excel file, or empty
        URL = 'http://' + i  # add before to complete the link - to access
        try:   # sometimes there is no email, and otherwise it breaks the code apparently. Workaround? 
            validator = details_pag_antillephone(URL, 'col-md-9', 'flex-row') #col-md-9 is the box containing the text, flex-row contains the company, name, address
            company = validator[0][0].text.strip().split('\n')[1] # always the 1st after split
            address = validator[0][2].text.strip().split('\n')[1] 
            email = re.findall('[\w\.-]+@[\w\.-]+', validator[1].text)[0]
            websites = re.findall('www.[\w\.-]+', validator[1].text)
            dataDic[email] = [company, ''*20, address, websites]
        except:
            continue
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))


df = pd.DataFrame(list(dataDic.items()), columns=['Company', 'Detalii'])
print(df)
with pd.ExcelWriter(r'C:\Users\Adrian\Desktop\All_test_1.xlsx') as writer:
    df.to_excel(writer)

Example of what I'm currently working with:

<div class="col-md-9">
            <div class="box">
              <h3>Antillephone License Validation</h3>
              <div class="details subheader">This page was generated on Mar 5, 2020</div>
              <div class="separator"></div>
                <div class="flex-row">
                  <div>Company name</div>
                  <div>Hero Island N.V. </div>
                </div>
                  <div class="separator"></div>
                  <div class="flex-row">
                    <div>Trade name</div>
                    <div>Hero Island N.V. </div>
                  </div>
                <div class="separator"></div>
                <div class="flex-row">
                  <div>Address</div>
                  <div>Curacao, CW</div>
                </div>
                <div class="separator"></div>
                <div class="flex-row">
                  <div>E-mail</div>
                  <div><script>var _0x2a84=['support@casitabi.com'];(function(_0x2ed5df,_0x4a695a){var _0x2f05f3=function(_0x44dc82){while(--_0x44dc82){_0x2ed5df['push'](_0x2ed5df['shift']());}};_0x2f05f3(++_0x4a695a);}(_0x2a84,0x133));var _0x42a8=function(_0x4179b2,_0x957766){_0x4179b2=_0x4179b2-0x0;var _0x2f7fe3=_0x2a84[_0x4179b2];return _0x2f7fe3;};document['write'](_0x42a8('0x0'));</script>support@casitabi.com</div>
                </div>
                <div class="separator"></div>
                <div class="flex-row">
                  <div>Registered Website(s)</div>
                  <div>
                      <a href="//www.simplecasinojp.com"><span>www.simplecasinojp.com</span><br></a>
                      <a href="//www.casitabi.com"><span>www.casitabi.com</span><br></a>
                      <a href="//www.purecasino.com"><span>www.purecasino.com</span><br></a>
                  </div>
                </div>
                <div class="separator" style="margin-bottom: 18px;"></div>

Viewing all articles
Browse latest Browse all 88066

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>