How to parse html in java?

Question

How to parse html in java?

You need to parse the web page and extract the path to the image from it. I can't figure out the HTML Parser class in any way. I need an example - I can't find it in the net.

9

English java html

Author: titov_andrei, 2011-03-21

Source

6 answers

Try jsoup. I really liked it.

3

Author: glook, 2011-08-30 16:13:18

For simple cases, you can use the standard Java API. For example, you can get the file via HttpURLConnection and find it using regular expressions.

2

Author: yozh, 2011-03-21 08:17:45

Jsoup: Java HTML Parser:

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");

2

Author: rododendron, 2015-04-02 19:55:39

You can use the standard Java tools. Why use an additional lib to extract the path to the image?

If you need to do it once, you can use DOM and XPath. If you need to process a bunch of large documents, it is better to use SAX. Once you spend time parsing these methods, you will never again know problems with parsing not only HTML, but also any XML documents.

Using the same regular expressions for parsing HTML is not a good practice-simply because of the flexibility and optional nature of the HTML syntax.

0

Author: Александр Князев, 2011-03-21 09:22:39

Look at this. A fairly simple principle of operation, supports invalid pages. There is a collection of objects that are mapped to tags. Very convenient. If you have any questions, please ask.

0

Author: keebraa, 2011-04-01 07:41:37

score 11 · Accepted Answer

And what's so difficult?

They have the most ordinary documentation in JavaDoc. But even there, you can find almost everything you need. For example:

Typical usage of the parser is:

Parser parser = new Parser ("http://whatever");
NodeList list = parser.parse (null);
// do something with your list of nodes.

And then watch some more:

NodeList parse(NodeFilter filter)

NodeFilter -> here

Everything, in my opinion, is too simple.

Not to mention

Bin/parser http://website_url [tag_name] where tag_name is an optional tag name to be used as a filter, i.e. A - Show only the link tags extracted from the document IMG - Show only the image tags extracted from the document TITLE - Extract the title from the document NOTE: this is also the default program for the htmlparser.jar, so the above could be: java -jar lib/htmlparser.jar http://website_url [tag_name]

UPD:

public static void main(String[] args) {
    try {
        Parser parser = new Parser("http://www.alliance-bags.ru/catalog.php?tov=576");
        parser.setEncoding("windows-1251");

        NodeFilter atrb1 = new TagNameFilter("IMG");
        NodeList nodeList = parser.parse(atrb1);

        for(int i=0; i<nodeList.size(); i++) {
            Node node = nodeList.elementAt(i);
            System.out.println(node.toHtml());
        }

    } catch (ParserException e) {
        e.printStackTrace();
    }
}