Package org.apache.nutch.scoring.opic
Class OPICScoringFilter
- java.lang.Object
-
- org.apache.nutch.scoring.opic.OPICScoringFilter
-
- All Implemented Interfaces:
Configurable,Pluggable,ScoringFilter
public class OPICScoringFilter extends Object implements ScoringFilter
This plugin implements a variant of an Online Page Importance Computation (OPIC) score, described in this paper: Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive On-Line Page Importance Computation.- Author:
- Andrzej Bialecki
-
-
Field Summary
-
Fields inherited from interface org.apache.nutch.scoring.ScoringFilter
X_POINT_ID
-
-
Constructor Summary
Constructors Constructor Description OPICScoringFilter()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description CrawlDatumdistributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount)Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.floatgeneratorSortValue(Text url, CrawlDatum datum, float initSort)ConfigurationgetConf()floatindexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore)Dampen the boost value by scorePower.voidinitialScore(Text url, CrawlDatum datum)Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level.voidinjectedScore(Text url, CrawlDatum datum)Set an initial score for newly injected pages.voidpassScoreAfterParsing(Text url, Content content, Parse parse)Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.voidpassScoreBeforeParsing(Text url, CrawlDatum datum, Content content)Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.voidsetConf(Configuration conf)voidupdateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked)Increase the score by a sum of inlinked scores.-
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
-
Methods inherited from interface org.apache.nutch.scoring.ScoringFilter
orphanedScore
-
-
-
-
Method Detail
-
getConf
public Configuration getConf()
- Specified by:
getConfin interfaceConfigurable
-
setConf
public void setConf(Configuration conf)
- Specified by:
setConfin interfaceConfigurable
-
injectedScore
public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException
Description copied from interface:ScoringFilterSet an initial score for newly injected pages. Note: newly injected pages may have no inlinks, so filter implementations may wish to set this score to a non-zero value, to give newly injected pages some initial credit.- Specified by:
injectedScorein interfaceScoringFilter- Parameters:
url- url of the pagedatum- new datum. Filters will modify it in-place.- Throws:
ScoringFilterException- if there is a fatal error setting an initial score for newly injected pages
-
initialScore
public void initialScore(Text url, CrawlDatum datum) throws ScoringFilterException
Set to 0.0f (unknown value) - inlink contributions will bring it to a correct level. Newly discovered pages have at least one inlink.- Specified by:
initialScorein interfaceScoringFilter- Parameters:
url- url of the pagedatum- new datum. Filters will modify it in-place.- Throws:
ScoringFilterException- if there is a fatal error setting an initial score for newly discovered pages
-
generatorSortValue
public float generatorSortValue(Text url, CrawlDatum datum, float initSort) throws ScoringFilterException
- Specified by:
generatorSortValuein interfaceScoringFilter- Parameters:
url- url of the pagedatum- page's datum, should not be modifiedinitSort- initial sort value, or a value from previous filters in chain- Returns:
- a sort value for use in sorting and selecting the top N scoring pages during fetchlist generation
- Throws:
ScoringFilterException- if there is a fatal error preparing the sort value
-
updateDbScore
public void updateDbScore(Text url, CrawlDatum old, CrawlDatum datum, List<CrawlDatum> inlinked) throws ScoringFilterException
Increase the score by a sum of inlinked scores.- Specified by:
updateDbScorein interfaceScoringFilter- Parameters:
url- url of the pageold- original datum, with original score. May be null if this is a newly discovered page. If not null, filters should use score values from this parameter as the starting values - thedatumparameter may contain values that are no longer valid, if other updates occurred between generation and this update.datum- the new datum, with the original score saved at the time when fetchlist was generated. Filters should update this in-place, and it will be saved in the crawldb.inlinked- (partial) list of CrawlDatum-s (with their scores) from links pointing to this page, found in the current update batch.- Throws:
ScoringFilterException- there is a fatal error calculating a new score ofCrawlDatumduring CrawlDb update
-
passScoreBeforeParsing
public void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.- Specified by:
passScoreBeforeParsingin interfaceScoringFilter- Parameters:
url- url of the pagedatum- source datum. NOTE: modifications to this value are not persisted.content- instance of content. Implementations may modify this in-place, primarily by setting some metadata properties.
-
passScoreAfterParsing
public void passScoreAfterParsing(Text url, Content content, Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.- Specified by:
passScoreAfterParsingin interfaceScoringFilter- Parameters:
url- page urlcontent- original content. NOTE: modifications to this value are not persisted.parse- target instance to copy the score information to. Implementations may modify this in-place, primarily by setting some metadata properties.
-
distributeScoreToOutlinks
public CrawlDatum distributeScoreToOutlinks(Text fromUrl, ParseData parseData, Collection<Map.Entry<Text,CrawlDatum>> targets, CrawlDatum adjust, int allCount) throws ScoringFilterException
Get a float value from Fetcher.SCORE_KEY, divide it by the number of outlinks and apply.- Specified by:
distributeScoreToOutlinksin interfaceScoringFilter- Parameters:
fromUrl- url of the source pageparseData- ParseData instance, which stores relevant score value(s) in its metadata. NOTE: filters may modify this in-place, all changes will be persisted.targets- <url, CrawlDatum> pairs. NOTE: filters can modify this in-place, all changes will be persisted.adjust- a CrawlDatum instance, initially null, which implementations may use to pass adjustment values to the original CrawlDatum. When creating this instance, set its status toCrawlDatum.STATUS_LINKED.allCount- number of all collected outlinks from the source page- Returns:
- if needed, implementations may return an instance of CrawlDatum,
with status
CrawlDatum.STATUS_LINKED, which contains adjustments to be applied to the original CrawlDatum score(s) and metadata. This can be null if not needed. - Throws:
ScoringFilterException- there is a fatal error distributing score data from the current page to all of its outlinks
-
indexerScore
public float indexerScore(Text url, NutchDocument doc, CrawlDatum dbDatum, CrawlDatum fetchDatum, Parse parse, Inlinks inlinks, float initScore) throws ScoringFilterException
Dampen the boost value by scorePower.- Specified by:
indexerScorein interfaceScoringFilter- Parameters:
url- url of the pagedoc- indexed document. NOTE: this already contains all information collected by indexing filters. Implementations may modify this instance, in order to store/remove some information.dbDatum- current page from CrawlDb. NOTE:- changes made to this instance are not persisted
- may be null if indexing is done without CrawlDb or if the segment is generated not from the CrawlDb (via FreeGenerator).
fetchDatum- datum from FetcherOutput (containing among others the fetching status)parse- parsing result. NOTE: changes made to this instance are not persisted.inlinks- current inlinks from LinkDb. NOTE: changes made to this instance are not persisted.initScore- initial boost value for the indexed document.- Returns:
- boost value for the indexed document. This value is passed as an argument to the next scoring filter in chain. NOTE: implementations may also express other scoring strategies by modifying the indexed document directly.
- Throws:
ScoringFilterException- if there is a fatal error whilst calculating the indexed document score/boost
-
-