My dear reader, how are you? السلام عليكم

Very little is needed to make a happy life; it is all within yourself, in your way of thinking – Marcus Aurelius

This is the fourth part of StackDuplica web-Application tutorial series. We will learn to scrap the web using Python, feed the data extracted from the web into our Django backend database and build a Django REST API to allow view it on the frontend template.


Few useful links to practically follow the project:

  1. StackDuplica GitHub repository — DirectMe
  2. All other tutorials on StackDuplica — DirectMe
  3. Set your GitHub repo to StackDuplica Part 3 using the following command:
git fetch origin 7a5b8217d6f7fd8238cfa4ecd1638e53e2552ca9

We scrap the first 3 pages of StackOverflow for questions that involve python and pandas tags. For each question, we will extract the title of the question, the number of tags associated, the number of views and the vote counts of each question. Once scraped we will then save the data into our Django models and finally serve this data using an API following REST principals. We will use Django REST framework to develop an API.

SCRAP StackOverflow using beautifulsoup

Let us start by first writing a small program that scraps the first three pages of StackOverflow questions. We use beautifulsoup for scraping. Following piece of code does it.

import requests
from bs4 import BeautifulSoup
import json

end_page_num = 3
i = 1
while i <= end_page_num:

res = requests.get("https://stackoverflow.com/questions/tagged/python%2bpandas?tab=newest&page={}&pagesize=50".forma
t(i))
soup = BeautifulSoup(res.text, "html.parser")

questions_data = {
"questions": []
}

questions = soup.select(".question-summary")

for que in questions:
q = que.select_one('.question-hyperlink').getText()
vote_count = que.select_one('.vote-count-post').getText()
views = que.select_one('.views').attrs['title']
tags = [i.getText() for i in (que.select('.post-tag'))]
questions_data['questions'].append({
"question": q,
"views": views,
"vote_count": vote_count,
"tags": tags
})

json_data = json.dumps(questions_data)

print(json_data)

i += 1

Let us now use this scraping program inside Django to achieve our objective.

Creating a Model

# open StackApp/qanda/models.py and add the following program

class Scrapedquestion(models.Model):
    question = models.CharField(max_length=300)
    vote_count = models.IntegerField(default=0)
    views = models.CharField(max_length=50)
    tags = models.CharField(max_length=250)

    def __str__(self):
        return self.question

Register model on admin site

# create StackApp/qanda/admin.py and add the following program

from .models import Scrapedquestion
admin.site.register(Scrapedquestion)

Adding a serializer for API

# create StackApp/qanda/serializer.py and add the following program

from rest_framework import serializers
from .models import Scrapedquestion

class QuestionSerializer(serializers.ModelSerializer):
    class Meta:
        model = Scrapedquestion
        fields = ('__all__')

Defining Views for The API and SCRAPER

# open StackApp/qanda/views.py and add the following program

from django.http import HttpResponse
from rest_framework import viewsets
from .models import Question
from .serializer import ScrapedquestionSerializer
from bs4 import BeautifulSoup

import requests
import json

def index(request):
    return HttpResponse("Success")

class QuestionAPI(viewsets.ModelViewSet):
    queryset = Scrapedquestion.objects.all()
    serializer_class = ScrapedquestionSerializer

def latest(request):
    try:
    end_page_num = 3
    i = 1
    while i <= end_page_num:
        res = requests.get("https://stackoverflow.com/questions/tagged/python%2bpandas?tab=newest&page={}&pagesize=50".format(i))
        soup = BeautifulSoup(res.text, "html.parser")
        questions = soup.select(".question-summary")
        for que in questions:
            q = que.select_one('.question-hyperlink').getText()
            vote_count = que.select_one('.vote-count-post').getText()
            views = que.select_one('.views').attrs['title']
            tags = [i.getText() for i in (que.select('.post-tag'))]
            question = Scrapedquestion()
            question.question = q
            question.vote_count = vote_count
            question.views = views
            question.tags = tags
            question.save()
        i += 1
        return HttpResponse("Latest Data Fetched from Stack Overflow")
    except e as Exception:
        return HttpResponse(f"Failed {e}")

Add the routes

# open StackApp/qanda/urls.py and add the following program

from .views import index, QuestionAPI, latest
from rest_framework import routers

router = routers.DefaultRouter()
router.register("questions", QuestionAPI)

app_name = 'qanda'

urlpatterns = [
path('ask', views.AskQuestionView.as_view(), name='ask'),
path('question/<int:pk>', views.QuestionDetailView.as_view(),name='question_detail'),
path('question/<int:pk>/answer', views.CreateAnswerView.as_view(),name='answer_question'),
path('question/<int:pk>/accept', views.UpdateAnswerAcceptanceView.as_view(),name='update_answer_acceptance'),
path('daily/<int:year>/<int:month>/<int:day>/', views.DailyQuestionList.as_view(), name='daily_questions'),
path('', views.TodaysQuestionList.as_view(), name='index'),
path('success', index, name="index"),
path('', include(router.urls)),
path('scrap', latest, name="latest"),
]

If you run the development server at this point in time you should be able to see the added functionalities

Scrap first by typing the following URL: http://127.0.0.1:8000/scrap. If you see the following message “Latest Data Fetched from Stack Overflow”, it means that the questions are successfully loaded in our models. 

Finally, if you want to see the browsable REST API in action, visit: http://127.0.0.1:8000/questions


I hope you find this tutorial useful. If you find any errors or feel any need for improvement, let me know in your comments below.

Signing off for today. Stay tuned and I will see you next week! Happy learning.

LEAVE A REPLY

Please enter your comment!
Please enter your name here