Class DefaultEntityResolver

java.lang.Object
io.sf.carte.doc.xml.dtd.DefaultEntityResolver
All Implemented Interfaces:
EntityResolver, EntityResolver2

public class DefaultEntityResolver extends Object implements EntityResolver2
Implements EntityResolver2.

Has common W3C DTDs/entities built-in and loads others via the supplied SYSTEM URL, provided that certain conditions are met:

  • URL protocol is http/https.
  • Either the mime type is valid for a DTD or entity, or the filename ends with .dtd, .ent or .mod.
  • The whitelist is either disabled (no host added to it) or contains the host from the URL.

If the whitelist was enabled (e.g. default constructor), any attempt to download data from a remote URL not present in the whitelist is going to produce an exception. You can use that to determine whether your documents are referencing a DTD resource that is not bundled with this resolver.

If the constructor with a false argument was used, the whitelist can still be enabled by adding a hostname via addHostToWhiteList(String).

Although this resolver should protect you from most information leaks (see SSRF attacks) and also from jar: decompression bombs, DoS attacks based on entity expansion/recursion like the 'billion laughs attack' may still be possible and should be prevented at the XML parser. Be sure to use a properly configured, recent version of your parser.

  • Constructor Details

    • DefaultEntityResolver

      public DefaultEntityResolver()
      Construct a resolver with the whitelist enabled.
    • DefaultEntityResolver

      public DefaultEntityResolver(boolean enableWhitelist)
      Construct a resolver with the whitelist enabled or disabled according to enableWhitelist.
      Parameters:
      enableWhitelist - can be false to allow connecting to any host to retrieve DTDs or entities, or true to enable the (empty) whitelist so no network connections are to be allowed until a host is added to it.
  • Method Details

    • addHostToWhiteList

      public void addHostToWhiteList(String fqdn)
      Add the given host to a whitelist for remote DTD fetching.

      If the whitelist is enabled, only http or https URLs will be allowed.

      Parameters:
      fqdn - the fully qualified domain name to add to the whitelist.
    • getExternalSubset

      public InputSource getExternalSubset(String name, String baseURI) throws SAXException, IOException
      Allows applications to provide an external subset for documents that don't explicitly define one.

      Documents with DOCTYPE declarations that omit an external subset can thus augment the declarations available for validation, entity processing, and attribute processing (normalization, defaulting, and reporting types including ID). This augmentation is reported through the startDTD() method as if the document text had originally included the external subset; this callback is made before any internal subset data or errors are reported.

      This method can also be used with documents that have no DOCTYPE declaration. When the root element is encountered but no DOCTYPE declaration has been seen, this method is invoked. If it returns a value for the external subset, that root element is declared to be the root element, giving the effect of splicing a DOCTYPE declaration at the end the prolog of a document that could not otherwise be valid. The sequence of parser callbacks in that case logically resembles this:

       ... comments and PIs from the prolog (as usual)
       startDTD ("rootName", source.getPublicId (), source.getSystemId ());
       startEntity ("[dtd]");
       ... declarations, comments, and PIs from the external subset
       endEntity ("[dtd]");
       endDTD ();
       ... then the rest of the document (as usual)
       startElement (..., "rootName", ...);
       

      Note that the InputSource gets no further resolution. Also, this method will never be used by a (non-validating) processor that is not including external parameter entities.

      Uses for this method include facilitating data validation when interoperating with XML processors that would always require undesirable network accesses for external entities, or which for other reasons adopt a "no DTDs" policy.

      Warning: returning an external subset modifies the input document. By providing definitions for general entities, it can make a malformed document appear to be well formed.

      Specified by:
      getExternalSubset in interface EntityResolver2
      Parameters:
      name - Identifies the document root element. This name comes from a DOCTYPE declaration (where available) or from the actual root element.
      baseURI - The document's base URI, serving as an additional hint for selecting the external subset. This is always an absolute URI, unless it is null because the XMLReader was given an InputSource without one.
      Returns:
      an InputSource object describing the new external subset to be used by the parser. If no specific subset could be determined, an input source describing the HTML5 entities is returned.
      Throws:
      SAXException - if either the provided arguments or the input source were invalid or not allowed.
      IOException - if an I/O problem was found while loading the input source.
    • registerSystemIdFilename

      protected boolean registerSystemIdFilename(String systemId, String filename)
      Register an internal classpath filename to retrieve a DTD SystemId.
      Parameters:
      systemId - the SystemId.
      filename - the internal filename. Must point to a resource with UTF-8 encoding.
      Returns:
      true if the new SystemId was successfully registered, false if it was already registered.
      Throws:
      IllegalArgumentException - if the filename is considered invalid by isInvalidPath(String).
    • resolveEntity

      public InputSource resolveEntity(String name, String publicId, String baseURI, String systemId) throws SAXException, IOException
      Allows applications to map references to external entities into input sources.

      This method is only called for external entities which have been properly declared. It provides more flexibility than the EntityResolver interface, supporting implementations of more complex catalogue schemes such as the one defined by the OASIS XML Catalogs specification.

      Parsers configured to use this resolver method will call it to determine the input source to use for any external entity being included because of a reference in the XML text. That excludes the document entity, and any external entity returned by getExternalSubset(). When a (non-validating) processor is configured not to include a class of entities (parameter or general) through use of feature flags, this method is not invoked for such entities.

      If no valid input source could be determined, this method will throw a SAXException instead of returning null as other implementations would do. If you have to retrieve a DTD which is not directly provided by this resolver, you need to whitelist the host using addHostToWhiteList(String) first. Make sure that either the systemId URL ends with a valid extension, or that the retrieved URL was served with a valid DTD media type.

      Note that the entity naming scheme used here is the same one used in the LexicalHandler, or in the ContentHandler.skippedEntity() method.

      Specified by:
      resolveEntity in interface EntityResolver2
      Parameters:
      name - Identifies the external entity being resolved. Either "[dtd]" for the external subset, or a name starting with "%" to indicate a parameter entity, or else the name of a general entity. This is never null when invoked by a SAX2 parser.
      publicId - The public identifier of the external entity being referenced (normalized as required by the XML specification), or null if none was supplied.
      baseURI - The URI with respect to which relative systemIDs are interpreted. This is always an absolute URI, unless it is null (likely because the XMLReader was given an InputSource without one). This URI is defined by the XML specification to be the one associated with the "<" starting the relevant declaration.
      systemId - The system identifier of the external entity being referenced; either a relative or absolute URI.
      Returns:
      an InputSource object describing the new input source to be used by the parser. This implementation never returns null if systemId is non-null.
      Throws:
      SAXException - if either the provided arguments or the input source were invalid or not allowed.
      IOException - if an I/O problem was found while forming the URL to the input source, or when connecting to it.
    • isInvalidPath

      protected boolean isInvalidPath(String path)
      Determine if the given path is considered invalid for a DTD.

      To be valid, must end with .dtd, .ent or .mod.

      Parameters:
      path - the path to check.
      Returns:
      true if the path is invalid for a DTD, false otherwise.
    • isWhitelistEnabled

      protected boolean isWhitelistEnabled()
      Is the whitelist enabled ?
      Returns:
      true if the whitelist is enabled.
    • isInvalidProtocol

      protected boolean isInvalidProtocol(String protocol)
      Is the given protocol not supported by this resolver ?

      Only http and https are valid.

      Parameters:
      protocol - the protocol.
      Returns:
      true if this resolver considers the given protocol invalid.
    • isWhitelistedHost

      protected boolean isWhitelistedHost(String host)
      Is the given host whitelisted ?
      Parameters:
      host - the host to test.
      Returns:
      true if the given host is whitelisted.
    • openConnection

      protected URLConnection openConnection(URL url) throws IOException
      Open a connection to the given URL.
      Parameters:
      url - the URL to connect to.
      Returns:
      the connection.
      Throws:
      IOException - if an I/O error happened opening the connection.
    • connect

      protected void connect(URLConnection con) throws IOException
      Connect the given URLConnection.
      Parameters:
      con - the URLConnection.
      Throws:
      IOException - if a problem happened connecting.
    • isValidContentType

      protected boolean isValidContentType(String conType)
      Is the given string a valid DTD/entity content-type ?
      Parameters:
      conType - the content-type.
      Returns:
      true if it is a valid DTD/entity content-type
    • resolveEntity

      public InputSource resolveEntity(String publicId, String systemId) throws SAXException, IOException
      Allow the application to resolve external entities.

      The parser will call this method before opening any external entity except the top-level document entity. Such entities include the external DTD subset and external parameter entities referenced within the DTD (in either case, only if the parser reads external parameter entities), and external general entities referenced within the document element (if the parser reads external general entities). The application may request that the parser locate the entity itself, that it use an alternative URI, or that it use data provided by the application (as a character or byte input stream).

      If no valid input source could be determined, this method will throw a SAXException instead of returning null as other implementations would do. If you have to retrieve a DTD which is not directly provided by this resolver, you need to whitelist the host using addHostToWhiteList(String) first. Make sure that either the systemId URL ends with a valid extension, or that the retrieved URL was served with a valid DTD media type.

      Specified by:
      resolveEntity in interface EntityResolver
      Parameters:
      publicId - The public identifier of the external entity being referenced, or null if none was supplied.
      systemId - The system identifier of the external entity being referenced.
      Returns:
      an InputSource object describing the new input source. This implementation never returns null if systemId is non-null.
      Throws:
      SAXException - if either the provided arguments or the input source were invalid or not allowed.
      IOException - if an I/O problem was found while forming the URL to the input source, or when connecting to it.
    • resolveEntity

      public InputSource resolveEntity(DocumentType dtDecl) throws SAXException, IOException
      Resolve external entities according to the given DocumentType.

      If no valid input source could be determined, this method will throw a SAXException instead of returning null as other implementations would do. If you have to retrieve a DTD which is not directly provided by this resolver, you need to whitelist the host using addHostToWhiteList(String) first. Make sure that either the systemId URL ends with a valid extension, or that the retrieved URL was served with a valid DTD media type.

      Parameters:
      dtDecl - the DocumentType.
      Returns:
      an InputSource object describing the new input source.
      Throws:
      SAXException - if either the provided arguments or the input source were invalid or not allowed.
      IOException - if an I/O problem was found while forming the URL to the input source, or when connecting to it.
    • setClassLoader

      public void setClassLoader(ClassLoader loader)
      Set the class loader to be used to read the built-in DTDs.
      Parameters:
      loader - the class loader.