With list of 40k links from which have to scrap emails/company names - takes forever the way I made it.
All websites look the same but they are lacking id's or any proper order/classification. Thus I have to search using regex for the email (or so I think) . For the rest it's easy, them being in the same order/place. In case I use it on a proper website, how could it be faster?
Afterwards, I add everything in a dictionary - because like this if the email repeats, it doesn't append. But then I don't know how to append everything nicely into a excel file. Right now it's split into 2 columns. Would be nice -> each their own column. And I also feel this is sluggish and slow.
Would it be faster to write everything after each find? in case the code breaks. Or it would just extend the time.
- The main issue is that there is 1 email - multiple company's and 10x more domains. Is there a way to have rows like this?
email - company names - domains
email - company names - domains
from bs4 import BeautifulSoup
import pandas as pd
from openpyxl import load_workbook
import requests
import re
import csv
from datetime import datetime
import time
def details_pag_antillephone(URL, restrans, detalii):
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find('div', class_=restrans)
detalii = results.find_all('div', class_=detalii)
return detalii, soup
dataDic = {}
links = []
df = pd.read_excel(r'C:\Users\Adrian\Desktop\40k_links.xlsx')
links = df['Unnamed: 7'].tolist()
start_time = datetime.now()
for i in links[10000:15000]: # Run 5000 intervals, takes too long otherwise
if i != float: # sometimes there are floats in the excel file, or empty
URL = 'http://' + i # add before to complete the link - to access
try: # sometimes there is no email, and otherwise it breaks the code apparently. Workaround?
validator = details_pag_antillephone(URL, 'col-md-9', 'flex-row') #col-md-9 is the box containing the text, flex-row contains the company, name, address
company = validator[0][0].text.strip().split('\n')[1] # always the 1st after split
address = validator[0][2].text.strip().split('\n')[1]
email = re.findall('[\w\.-]+@[\w\.-]+', validator[1].text)[0]
websites = re.findall('www.[\w\.-]+', validator[1].text)
dataDic[email] = [company, ''*20, address, websites]
except:
continue
end_time = datetime.now()
print('Duration: {}'.format(end_time - start_time))
df = pd.DataFrame(list(dataDic.items()), columns=['Company', 'Detalii'])
print(df)
with pd.ExcelWriter(r'C:\Users\Adrian\Desktop\All_test_1.xlsx') as writer:
df.to_excel(writer)
Example of what I'm currently working with:
<div class="col-md-9">
<div class="box">
<h3>Antillephone License Validation</h3>
<div class="details subheader">This page was generated on Mar 5, 2020</div>
<div class="separator"></div>
<div class="flex-row">
<div>Company name</div>
<div>Hero Island N.V. </div>
</div>
<div class="separator"></div>
<div class="flex-row">
<div>Trade name</div>
<div>Hero Island N.V. </div>
</div>
<div class="separator"></div>
<div class="flex-row">
<div>Address</div>
<div>Curacao, CW</div>
</div>
<div class="separator"></div>
<div class="flex-row">
<div>E-mail</div>
<div><script>var _0x2a84=['support@casitabi.com'];(function(_0x2ed5df,_0x4a695a){var _0x2f05f3=function(_0x44dc82){while(--_0x44dc82){_0x2ed5df['push'](_0x2ed5df['shift']());}};_0x2f05f3(++_0x4a695a);}(_0x2a84,0x133));var _0x42a8=function(_0x4179b2,_0x957766){_0x4179b2=_0x4179b2-0x0;var _0x2f7fe3=_0x2a84[_0x4179b2];return _0x2f7fe3;};document['write'](_0x42a8('0x0'));</script>support@casitabi.com</div>
</div>
<div class="separator"></div>
<div class="flex-row">
<div>Registered Website(s)</div>
<div>
<a href="//www.simplecasinojp.com"><span>www.simplecasinojp.com</span><br></a>
<a href="//www.casitabi.com"><span>www.casitabi.com</span><br></a>
<a href="//www.purecasino.com"><span>www.purecasino.com</span><br></a>
</div>
</div>
<div class="separator" style="margin-bottom: 18px;"></div>