Jsoup Parsing Page Knowing Url
Solution 1:
jsoup
wouldn't handle your dynamic actions on a web page. You would need to use an API which can handle these dynamic executions - an example being HtmlUnit
.
Let's say you have a possibility all the links stored as part of a Java Collection instance like an ArrayList
. If I try to parse the first url in the form of a specific method (which can be looped over to get the contents at runtime for all the url on your page dynamically):
Using HtmlUnit
publicstaticvoidmain(String... args)throws FailingHttpStatusCodeException, IOException {
finalWebClientwebClient=newWebClient(BrowserVersion.FIREFOX_17);
WebRequestrequest=newWebRequest(
newURL(
"http://multiplayer.it/articoli/"));
webClient.getOptions().setThrowExceptionOnScriptError(false);
webClient.setJavaScriptTimeout(10000);
webClient.getOptions().setJavaScriptEnabled(true);
webClient.setAjaxController(newNicelyResynchronizingAjaxController());
webClient.getOptions().setTimeout(10000);
HtmlPagepage= webClient.getPage(request);
webClient.waitForBackgroundJavaScript(10000);
System.out.println("Current page: Articoli videogiochi - Multiplayer.it");
// Current page:// Title=Articoli videogiochi - Multiplayer.it// URL=http://multiplayer.it/articoli/
List<HtmlAnchor> anchors1 = page.getAnchors();
HtmlAnchorlink2=null;
for(HtmlAnchor anchor: anchors1)
{
if(anchor.asText().indexOf("Dead Rising 3: Operation Broken Eagle") > -1 )
{
link2 = anchor;
break;
}
}
page = link2.click();
System.out.println("Current page: Dead Rising 3: Operation Broken Eagle - Recensione - Xbox On...");
// Current page:// Title=Dead Rising 3: Operation Broken Eagle - Recensione - Xbox On...// URL=http://multiplayer.it/recensioni/127745-dead-rising-3-operation-broken-eagle-una-delle-storie-di-los-perdidos.html
webClient.waitForBackgroundJavaScript(10000);
DomNodeList<DomElement> paras = page.getElementsByTagName("p");
for (DomElement el : paras.toArray(newDomElement[paras.size()])) {
System.out.println(el.asText());
}
}
In the above code, it displays all the <p>
available on the landing page. Below is the screenshot of the output:
In the above code block, you have the ability to loop over all the anchor tags on the web page, and I choose a specific anchor link to get the resulting content:
List<HtmlAnchor> anchors1 = page.getAnchors();
HtmlAnchorlink2=null;
for(HtmlAnchor anchor: anchors1)
{
if(anchor.asText().indexOf("Dead Rising 3: Operation Broken Eagle") > -1 )
{
link2 = anchor;
break;
}
}
You might want to right an appropriate logic to parse all the dynamic links on your page and display their contents.
EDIT:
You can try generating these dynamic scripts through htmlunitscripter Firefox plugin and customize it later to your needs too.
Post a Comment for "Jsoup Parsing Page Knowing Url"