Web Scraping Using PHP – Parse IMDB.com Movies HTML



Upgrade your Clever Techie learning experience:

UPDATE! (9/13/19) New features and improvements for Clever Techie Patreons:

1. Download full source code with detailed comments – easy to learn and understand code
2. Weekly source code file updates by Clever Techie – every time I learn new things about a topic I will add it to the source file and let you know about the update – keep up with the latest coding technologies
3. Library of custom Clever Techie functions with descriptive, easy to understand comments – skyrocket coding productivity, code more efficiently by using Clever library of custom re-usable functions
4. Syntax code summary – memorize and review previously learned code faster
4. Organized file structure – access all Clever Techie lessons, source code, graphics, diagrams and cheat sheet from a single workspace – no more searching around for previously covered material and source code – save enormous amount of time and effort
5. Outline of topics the source file covers – fast review of all previously learned coding lessons
6. Access to all full HD 1080p videos with no ads
7. Console input examples – interactive examples that make it easier to understand and learn coding
8. Access to updated PHP Programming Book by Clever Techie
9. Early access to Clever Techie videos

Subscribe to Clever Techie patreon:

““““““““““““““““““““““““““““““““““““““““““““““`

Using PHP and regular expressions, we’re going to parse the movie content of IMDB.com and save all the data in one single array. Web scraping using regex can be very powerful and this video proves it. We account for empty elements by matching groups of HTML blocks, looping through the blocks of matched content and then matching single elements, if they’re found from the block. This technique of matching content and web scraping can be used on just about any web site to parse out it’s content.

““““““““““““““““““““““““““““““““““““““““““““““`
Hey guys, I’m now using Patreon to share improved and updated video lesson material. For a small fee you can access all the downloadable files from this lesson (source code, icons & graphics, cheat sheets) and everything else included in the video from the Patreon page. Additionally, you will get access to ALL Clever Techie videos in HD format with no ads. Thank you so much for supporting Clever Techie 🙂

Download this video’s files here:

This download (Patreon unlock) includes:
(PHP regex function source code, PHP regex screen shots, PHP regex cheat sheet)
+
( You also get access to ALL source code and any downloadable content of ALL Clever Techie videos, as well as access to ALL videos in HD 1080p quality format with all video ads removed! )

““““““““““““““““““““““““““““““““““““““““““““““`
In this web scraping tutorial we’re going to be using regular expressions to parse HTML. This is a more advanced tutorial so you can check out my video on regular expressions before going through this. We’re going to be parsing out the IMDb website, which is an Internet movie database, and I’m going to be using a website called www.regex101.com to test regular expressions against strings to make sure we’re matching them correctly. Because this is an advanced tutorial, I’ll be posting each portion of code and explaining how it works as we walk through it. Directly below is the full source code, but skip down further and I’ll walk through each portion of the code.

““““““““““““““““““““““““““““““““““““““““““““““`
( Website ) – PHP, JavaScript, WordPress, CSS, and HTML tutorials in video and text format with cool looking graphics and diagrams.

( YouTube Channel )

( Google Plus ) – clever techie video tutorials.

( Facebook )

( Twitter )

Nguồn: https://svdpch.org/

Xem thêm bài viết khác: https://svdpch.org/cong-nghe/

45 thoughts on “Web Scraping Using PHP – Parse IMDB.com Movies HTML”

  1. hi man/how to parse data from different pages? example.i want to parse content from the main page(date of concert,heading and image) and inner page(when we clicked on the heading we are step up in the inner page) with preambule.

    Reply
  2. Sir Need Help is your Code works for Scraping those websites which are blocking the scrapers I need help please Contact me asap Thanks Regards Bilal Khan

    Reply
  3. Hi I like your previous video though I didnt quite understand how you get the movies from IMDb. But am working on a website for movies and also TV series, so is their any way you can show me how to also add TV series files from IMDb and how it works,like am I to register in IMDb or what please

    Reply
  4. Found a couple of issues first is the url needs to be https instead of http, the pagination has changed it now gives 50 results and a second page starts at 51 I made a variable pageblock and then incremented it in the for loop by 50 so 1, 51, 101 not quite perfect but the strange thing is the array is reversed. page 3 is 0-49 page2 50-100 and page 1 is 100 to 150; kind of expected it to fill page1, 2,3, but this is reversed.

    Reply
  5. Thank you very much , but I am afraid you did it the wrong way , I think we have to us a HTML parsing library instead of regex , because regex cann't parse element by element , for example if a movie have no Gross nor Votes and this movie happened to be the last one of the current page (i .e : their is no next <div class="lister-item mode-advanced">) , how could we get this using the regex , if we us an Html parsing library we could walk movie by movie and parse all its info , any way I enjoyed the video thank you again 🙂

    Reply
  6. Great video! Well, ScraperCrawler is simply doing both i.e. Crawl and Scrape thousands of websites in one click- http://www.scrapercrawler.com/

    Reply
  7. i need an answer for my Question
    Question : who is better for making movies website HTML or WordPress??
    and i'm learned HTML and CSS and JavaScript

    Reply
  8. Could you explain what is the difference between starting with ! and starting with / when I use the regular expression functions of PHP?
    Because I used preg_match_all('/regular expression/',$result,$matcg) and it worked except the regular expression, '/Directors?:n<a href="/name/.*?/?ref_=adv_li_dr_0"n>(.*?)</a>n/s',used for obtaining the names of directors.
    Thank you.

    Reply
  9. I am not getting the required output. Can you tell me what all is required to run this php file for the desired output

    Reply
  10. You can't parse [X]HTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML……….

    Reply
  11. Dear Sir,

    Do you ever scrape data from block JavaScript?
    Do you have a solution & simple code?
    Can you help me? I will wait for help from you.

    Reply
  12. What are the benefits of using a linked stylesheet instead of inline styles?
    Code editors and IDEs have special highlighting for external styles that doesn't work for inline styles.
    External styles allow the same CSS to be applied to multiple pages. You can avoid repeating code across every page.
    Linked styles render more quickly than inline styles.
    Linked styles can use a greater range of CSS elements and selectors than inline styles.

    what is the answer please

    Reply
  13. preg_match_all('!<a href="/title/.?/?ref_=adv_li_tt"n>(.?)</a>!',$result,$match);
    $movies['name'] = $match[1];

    its not getting value.
    plz help me

    Reply
  14. i have problem with some pages
    when i use :
    $ch = curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,True);
    $res = curl_exec($ch);
    echo $res;
    it shows a page like �w�Z�Ѹt����  d
    what should I do to solve this problem?

    Reply
  15. I've never seen an attempt to parse XML using a regex that won't break on some content.The real trouble is nested tags.Nested tags are very difficult to handle with regular expressions.If you want to find a very specific pattern in a (ht|x)ml file, go on, regex is perfect for that.But if you are searching for something in in every Foo tag, that could have attributes in different orders, that can be nested, that can be malformed (and still valid), then use a parser, because thats not pattern matching anymore.

    Reply
  16. Hey Clever Techie – From where did you get the regular expression to match with regex101? How could I get that information from any website. ?

    Reply
  17. Hey @Clever Tehie how are you making GET calls to imbd webiste and how are you dealing with CORS problem.Please revert TIA

    Reply
  18. When I view page source I get a blank page. I don't know why that happens. I have a blank page when I run parse_imdb.php then whenever I call the function scrape_imdb(2000,2000,76,76); and it is loading endlessly

    Reply
  19. great vid…what about websites with user authentication and load the data from react js?? I have the log in portion but I can load the element data because it is delivered from react

    Reply
  20. Great tutorial,
    In the regular expression used to collect the titles, at 2:37, you have
    " '!<a href="/title/.*?/?ref_=adv_li_tt"n>(.*?)</a>!' " what is the reason for the first '?'
    That is,
     why do you start with '!<a href="/title/.*? …'
    and not just with '!<a href="/title/.* …'

    Reply
  21. http://www.bengaluruairport.com/flightInformation/arrivals.jspx?_afrLoop=174885394513452&_afrWindowMode=0&_adf.ctrl-state=1540d4115o_194

    can you help me to fetch this table value in the database

    Reply

Leave a Comment