Skip to content Skip to sidebar Skip to footer

Jsoup Parsing Page Knowing Url

I'm in front of a very big problem to me.. I'm parsing this page http://multiplayer.it/articoli/ with inside some articles.. As you can see, there are some informations i can parse

Solution 1:

jsoup wouldn't handle your dynamic actions on a web page. You would need to use an API which can handle these dynamic executions - an example being HtmlUnit.

Let's say you have a possibility all the links stored as part of a Java Collection instance like an ArrayList. If I try to parse the first url in the form of a specific method (which can be looped over to get the contents at runtime for all the url on your page dynamically):

Using HtmlUnit

publicstaticvoidmain(String... args)throws FailingHttpStatusCodeException, IOException {
        finalWebClientwebClient=newWebClient(BrowserVersion.FIREFOX_17);

        WebRequestrequest=newWebRequest(
                newURL(
                        "http://multiplayer.it/articoli/"));

        webClient.getOptions().setThrowExceptionOnScriptError(false);
        webClient.setJavaScriptTimeout(10000);
        webClient.getOptions().setJavaScriptEnabled(true);
        webClient.setAjaxController(newNicelyResynchronizingAjaxController());
        webClient.getOptions().setTimeout(10000);

        HtmlPagepage= webClient.getPage(request);
        webClient.waitForBackgroundJavaScript(10000);

        System.out.println("Current page: Articoli videogiochi - Multiplayer.it");

        // Current page:// Title=Articoli videogiochi - Multiplayer.it// URL=http://multiplayer.it/articoli/

        List<HtmlAnchor> anchors1 =  page.getAnchors();
        HtmlAnchorlink2=null;
        for(HtmlAnchor anchor: anchors1)
        {
             if(anchor.asText().indexOf("Dead Rising 3: Operation Broken Eagle") > -1 )
             {
                  link2 = anchor;
                  break;
             }
        }
        page = link2.click();

        System.out.println("Current page: Dead Rising 3: Operation Broken Eagle - Recensione - Xbox On...");

        // Current page:// Title=Dead Rising 3: Operation Broken Eagle - Recensione - Xbox On...// URL=http://multiplayer.it/recensioni/127745-dead-rising-3-operation-broken-eagle-una-delle-storie-di-los-perdidos.html


        webClient.waitForBackgroundJavaScript(10000);

        DomNodeList<DomElement> paras = page.getElementsByTagName("p");
        for (DomElement el : paras.toArray(newDomElement[paras.size()])) {
            System.out.println(el.asText());
        }
    }

In the above code, it displays all the <p> available on the landing page. Below is the screenshot of the output:

enter image description here

In the above code block, you have the ability to loop over all the anchor tags on the web page, and I choose a specific anchor link to get the resulting content:

List<HtmlAnchor> anchors1 =  page.getAnchors();
            HtmlAnchorlink2=null;
            for(HtmlAnchor anchor: anchors1)
            {
                 if(anchor.asText().indexOf("Dead Rising 3: Operation Broken Eagle") > -1 )
                 {
                      link2 = anchor;
                      break;
                 }
            }

You might want to right an appropriate logic to parse all the dynamic links on your page and display their contents.

EDIT:

You can try generating these dynamic scripts through htmlunitscripter Firefox plugin and customize it later to your needs too.

Post a Comment for "Jsoup Parsing Page Knowing Url"