My dear reader, how are you? السلام عليكم
Very little is needed to make a happy life; it is all within yourself, in your way of thinking – Marcus Aurelius
This is the fourth part of StackDuplica web-Application tutorial series. We will learn to scrap the web using Python, feed the data extracted from the web into our Django backend database and build a Django REST API to allow view it on the frontend template.
Few useful links to practically follow the project:
- StackDuplica GitHub repository — DirectMe
- All other tutorials on StackDuplica — DirectMe
- Set your GitHub repo to StackDuplica Part 3 using the following command:
git fetch origin 7a5b8217d6f7fd8238cfa4ecd1638e53e2552ca9
We scrap the first 3 pages of StackOverflow for questions that involve python and pandas tags. For each question, we will extract the title of the question, the number of tags associated, the number of views and the vote counts of each question. Once scraped we will then save the data into our Django models and finally serve this data using an API following REST principals. We will use Django REST framework to develop an API.
SCRAP StackOverflow using beautifulsoup
Let us start by first writing a small program that scraps the first three pages of StackOverflow questions. We use beautifulsoup for scraping. Following piece of code does it.
import requests from bs4 import BeautifulSoup import json end_page_num = 3 i = 1 while i <= end_page_num: res = requests.get("https://stackoverflow.com/questions/tagged/python%2bpandas?tab=newest&page={}&pagesize=50".forma t(i)) soup = BeautifulSoup(res.text, "html.parser") questions_data = { "questions": [] } questions = soup.select(".question-summary") for que in questions: q = que.select_one('.question-hyperlink').getText() vote_count = que.select_one('.vote-count-post').getText() views = que.select_one('.views').attrs['title'] tags = [i.getText() for i in (que.select('.post-tag'))] questions_data['questions'].append({ "question": q, "views": views, "vote_count": vote_count, "tags": tags }) json_data = json.dumps(questions_data) print(json_data) i += 1
Let us now use this scraping program inside Django to achieve our objective.
Creating a Model
# open StackApp/qanda/models.py and add the following program class Scrapedquestion(models.Model): question = models.CharField(max_length=300) vote_count = models.IntegerField(default=0) views = models.CharField(max_length=50) tags = models.CharField(max_length=250) def __str__(self): return self.question
Register model on admin site
# create StackApp/qanda/admin.py and add the following program from .models import Scrapedquestion admin.site.register(Scrapedquestion)
Adding a serializer for API
# create StackApp/qanda/serializer.py and add the following program from rest_framework import serializers from .models import Scrapedquestion class QuestionSerializer(serializers.ModelSerializer): class Meta: model = Scrapedquestion fields = ('__all__')
Defining Views for The API and SCRAPER
# open StackApp/qanda/views.py and add the following program from django.http import HttpResponse from rest_framework import viewsets from .models import Question from .serializer import ScrapedquestionSerializer from bs4 import BeautifulSoup import requests import json def index(request): return HttpResponse("Success") class QuestionAPI(viewsets.ModelViewSet): queryset = Scrapedquestion.objects.all() serializer_class = ScrapedquestionSerializer def latest(request): try: end_page_num = 3 i = 1 while i <= end_page_num: res = requests.get("https://stackoverflow.com/questions/tagged/python%2bpandas?tab=newest&page={}&pagesize=50".format(i)) soup = BeautifulSoup(res.text, "html.parser") questions = soup.select(".question-summary") for que in questions: q = que.select_one('.question-hyperlink').getText() vote_count = que.select_one('.vote-count-post').getText() views = que.select_one('.views').attrs['title'] tags = [i.getText() for i in (que.select('.post-tag'))] question = Scrapedquestion() question.question = q question.vote_count = vote_count question.views = views question.tags = tags question.save() i += 1 return HttpResponse("Latest Data Fetched from Stack Overflow") except e as Exception: return HttpResponse(f"Failed {e}")
Add the routes
# open StackApp/qanda/urls.py and add the following program from .views import index, QuestionAPI, latest from rest_framework import routers router = routers.DefaultRouter() router.register("questions", QuestionAPI) app_name = 'qanda' urlpatterns = [ path('ask', views.AskQuestionView.as_view(), name='ask'), path('question/<int:pk>', views.QuestionDetailView.as_view(),name='question_detail'), path('question/<int:pk>/answer', views.CreateAnswerView.as_view(),name='answer_question'), path('question/<int:pk>/accept', views.UpdateAnswerAcceptanceView.as_view(),name='update_answer_acceptance'), path('daily/<int:year>/<int:month>/<int:day>/', views.DailyQuestionList.as_view(), name='daily_questions'), path('', views.TodaysQuestionList.as_view(), name='index'), path('success', index, name="index"), path('', include(router.urls)), path('scrap', latest, name="latest"), ]
If you run the development server at this point in time you should be able to see the added functionalities
Scrap first by typing the following URL: http://127.0.0.1:8000/scrap. If you see the following message “Latest Data Fetched from Stack Overflow”, it means that the questions are successfully loaded in our models.
Finally, if you want to see the browsable REST API in action, visit: http://127.0.0.1:8000/questions
I hope you find this tutorial useful. If you find any errors or feel any need for improvement, let me know in your comments below.
Signing off for today. Stay tuned and I will see you next week! Happy learning.