Package org.apache.nutch.parse
Class OutlinkExtractor
- java.lang.Object
-
- org.apache.nutch.parse.OutlinkExtractor
-
public class OutlinkExtractor extends Object
Extractor to extractOutlinks / URLs from plain text using Regular Expressions.- Since:
- 0.7
- Version:
- 1.0
- Author:
- Stephan Strittmatter - http://www.sybit.de
- See Also:
- Comparison of different regexp-Implementations , Overview about Java Regexp APIs
-
-
Constructor Summary
Constructors Constructor Description OutlinkExtractor()
-
Method Summary
All Methods Static Methods Concrete Methods Modifier and Type Method Description static Outlink[]getOutlinks(String plainText, String anchor, Configuration conf)ExtractsOutlinkfrom given plain text and adds anchor to the extractedOutlinksstatic Outlink[]getOutlinks(String plainText, Configuration conf)ExtractsOutlinkfrom given plain text.
-
-
-
Method Detail
-
getOutlinks
public static Outlink[] getOutlinks(String plainText, Configuration conf)
ExtractsOutlinkfrom given plain text. Applying this method to non-plain-text can result in extremely lengthy runtimes for parasitic cases (postscript is a known example).- Parameters:
plainText- the plain text from wich URLs should be extracted.conf- a populatedConfiguration- Returns:
- Array of
Outlinks within found in plainText
-
getOutlinks
public static Outlink[] getOutlinks(String plainText, String anchor, Configuration conf)
ExtractsOutlinkfrom given plain text and adds anchor to the extractedOutlinks- Parameters:
plainText- the plain text from wich URLs should be extracted.anchor- the anchor of the urlconf- a populatedConfiguration- Returns:
- Array of
Outlinks within found in plainText
-
-