Step 1. Set virtual environment
- Create a new directory under the C drive and virtual environment.
| 1 | $ mkdir crawling && cd crawling | 
- Install some required packages.
| 1 | $ pip install beautifulsoup4 | 
Step 2. Crawling Practice 1
- Create a HTML file - index.html- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 <html lang="en">
 <head>
 <meta charset="UTF-8">
 <titl>test</titl>
 </head>
 <body>
 <h1>aaaaaaaa</h1>
 <h2>dddd</h2>
 <div class="chapter01">
 <p>Don't Crawl here </p>
 </div>
 <div class="chapter02">
 <p>Just Crawling here</p>
 </div>
 <div id="main">
 <p> Crawling .................. </p>
 </div>
 </body>
 </html>
- Create a python file - main.pycrawling text from- index.html- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16- from bs4 import BeautifulSoup 
 def main():
 # Convert index.html to BeautifulSoup Object
 soup = BeautifulSoup(open("index.html", encoding='UTF-8'), "html.parser")
 print(type(soup))
 print(soup.find("p"))
 print("----------------")
 print(soup.find_all("p"))
 print("----------------")
 print(soup.find("div", class_ = "chapter02"))
 print("----------------")
 print(soup.find("div", id = "main").find("p").get_text())
 if __name__ == "__main__":
 main()
- Run the - main.pyand check the result printed.- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11- $ python main.py 
 <class 'bs4.BeautifulSoup'>
 <p>Don't crawl here!</p>
 ----------------
 [<p>Don't crawl here!</p>, <p>Just Crawl here!</p>, <p> Crawling .................. </p>]
 ----------------
 <div class="chapter02">
 <p>Just Crawl here!</p>
 </div>
 ----------------
Step 3. Quick Start BeautifulSoup4
- URL : https://www.crummy.com/software/BeautifulSoup/bs4/doc/#quick-start 
- index2.html- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 <html>
 <head>
 <title>The Dormouse's story</title>
 </head>
 <body>
 <p class="title"><b>The Dormouse's story</b></p>
 <p class="story">
 Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.
 </p>
 <p class="story">...</p>
 </body>
 </html>
- temp1.py- 1 
 2
 3
 4- from bs4 import BeautifulSoup 
 soup = BeautifulSoup(open("index2.html"), 'html.parser')
 print(soup.prettify())- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36- $ python temp1.py 
 <!DOCTYPE html>
 <html>
 <head>
 <title>
 The Dormouse's story
 </title>
 </head>
 <body>
 <p class="title">
 <b>
 The Dormouse's story
 </b>
 </p>
 <p class="story">
 Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">
 Elsie
 </a>
 ,
 <a class="sister" href="http://example.com/lacie" id="link2">
 Lacie
 </a>
 and
 <a class="sister" href="http://example.com/tillie" id="link3">
 Tillie
 </a>
 ;
 and they lived at the bottom of a well.
 </p>
 <p class="story">
 ...
 </p>
 </body>
 </html>
- temp2.py- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20- from bs4 import BeautifulSoup 
 soup = BeautifulSoup(open("index2.html"), 'html.parser')
 print(soup.title)
 print("----------------")
 print(soup.title.name)
 print("----------------")
 print(soup.title.string)
 print("----------------")
 print(soup.title.parent.name)
 print("----------------")
 print(soup.p)
 print("----------------")
 print(soup.p['class'])
 print("----------------")
 print(soup.a)
 print("----------------")
 print(soup.find_all('a'))
 print("----------------")
 print(soup.find(id="link3"))- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19- $ python temp2.py 
 <title>The Dormouse's story</title>
 ----------------
 title
 ----------------
 The Dormouse's story
 ----------------
 head
 ----------------
 <p class="title"><b>The Dormouse's story</b></p>
 ----------------
 ['title']
 ----------------
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 ----------------
 [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
 ----------------
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
- temp3.py- 1 
 2
 3
 4
 5
 6
 7- from bs4 import BeautifulSoup 
 soup = BeautifulSoup(open("index2.html"), 'html.parser')
 for link in soup.find_all('a'):
 print(link.get('href'))
 print(soup.get_text())- 1 
 2
 3
 4
 5
 6
 7
 8
 9
 10
 11
 12
 13
 14
 15
 16
 17- $ python temp3.py 
 http://example.com/elsie
 http://example.com/lacie
 http://example.com/tillie
 The Dormouse's story
 The Dormouse's story
 Once upon a time there were three little sisters; and their names were
 Elsie,
 Lacieand
 Tillie;
 and they lived at the bottom of a well.
 ...