Step 1. Set virtual environment
- Create a new directory under the C drive and virtual environment.
1 | $ mkdir crawling && cd crawling |
- Install some required packages.
1 | $ pip install beautifulsoup4 |
Step 2. Crawling Practice 1
Create a HTML file
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<html lang="en">
<head>
<meta charset="UTF-8">
<titl>test</titl>
</head>
<body>
<h1>aaaaaaaa</h1>
<h2>dddd</h2>
<div class="chapter01">
<p>Don't Crawl here </p>
</div>
<div class="chapter02">
<p>Just Crawling here</p>
</div>
<div id="main">
<p> Crawling .................. </p>
</div>
</body>
</html>Create a python file
main.py
crawling text fromindex.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16from bs4 import BeautifulSoup
def main():
# Convert index.html to BeautifulSoup Object
soup = BeautifulSoup(open("index.html", encoding='UTF-8'), "html.parser")
print(type(soup))
print(soup.find("p"))
print("----------------")
print(soup.find_all("p"))
print("----------------")
print(soup.find("div", class_ = "chapter02"))
print("----------------")
print(soup.find("div", id = "main").find("p").get_text())
if __name__ == "__main__":
main()Run the
main.py
and check the result printed.1
2
3
4
5
6
7
8
9
10
11$ python main.py
<class 'bs4.BeautifulSoup'>
<p>Don't crawl here!</p>
----------------
[<p>Don't crawl here!</p>, <p>Just Crawl here!</p>, <p> Crawling .................. </p>]
----------------
<div class="chapter02">
<p>Just Crawl here!</p>
</div>
----------------
Step 3. Quick Start BeautifulSoup4
URL : https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start
index2.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<html>
<head>
<title>The Dormouse's story</title>
</head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>temp1.py
1
2
3
4from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index2.html"), 'html.parser')
print(soup.prettify())1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36$ python temp1.py
<!DOCTYPE html>
<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>temp2.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index2.html"), 'html.parser')
print(soup.title)
print("----------------")
print(soup.title.name)
print("----------------")
print(soup.title.string)
print("----------------")
print(soup.title.parent.name)
print("----------------")
print(soup.p)
print("----------------")
print(soup.p['class'])
print("----------------")
print(soup.a)
print("----------------")
print(soup.find_all('a'))
print("----------------")
print(soup.find(id="link3"))1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19$ python temp2.py
<title>The Dormouse's story</title>
----------------
title
----------------
The Dormouse's story
----------------
head
----------------
<p class="title"><b>The Dormouse's story</b></p>
----------------
['title']
----------------
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
----------------
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
----------------
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>temp3.py
1
2
3
4
5
6
7from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index2.html"), 'html.parser')
for link in soup.find_all('a'):
print(link.get('href'))
print(soup.get_text())1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17$ python temp3.py
http://example.com/elsie
http://example.com/lacie
http://example.com/tillie
The Dormouse's story
The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacieand
Tillie;
and they lived at the bottom of a well.
...