[ Index ]
 

Code source de Typo3 4.1.3

Accédez au Source d'autres logiciels libres

Classes | Fonctions | Variables | Constantes | Tables

title

Body

[fermer]

/typo3/sysext/indexed_search/ -> class.indexer.php (sommaire)

This class is a search indexer for TYPO3

Author: Kasper Skårhøj <kasperYYYY@typo3.com>
Poids: 2071 lignes (73 kb)
Inclus ou requis:0 fois
Référencé: 0 fois
Nécessite: 0 fichiers

Définit 1 class

tx_indexedsearch_indexer:: (59 méthodes):
  hook_indexContent()
  backend_initIndexer()
  backend_setFreeIndexUid()
  backend_indexAsTYPO3Page()
  init()
  initializeExternalParsers()
  indexTypo3PageContent()
  splitHTMLContent()
  getHTMLcharset()
  convertHTMLToUtf8()
  embracingTags()
  typoSearchTags()
  extractLinks()
  extractHyperLinks()
  indexExternalUrl()
  getUrlHeaders()
  indexRegularDocument()
  readFileContent()
  fileContentParts()
  splitRegularContent()
  charsetEntity2utf8()
  processWordsInArrays()
  procesWordsInArrays()
  bodyDescription()
  indexAnalyze()
  analyzeHeaderinfo()
  analyzeBody()
  metaphone()
  submitPage()
  submit_grlist()
  submit_section()
  removeOldIndexedPages()
  submitFilePage()
  submitFile_grlist()
  submitFile_section()
  removeOldIndexedFiles()
  checkMtimeTstamp()
  checkContentHash()
  checkExternalDocContentHash()
  is_grlist_set()
  update_grlist()
  updateTstamp()
  updateSetId()
  updateParsetime()
  updateRootline()
  getRootLineFields()
  removeLoginpagesWithContentHash()
  includeCrawlerClass()
  checkWordList()
  submitWords()
  freqMap()
  setT3Hashes()
  setExtHashes()
  md5inthash()
  makeCHash()
  log_push()
  log_pull()
  log_setTSlogMessage()
  fe_headerNoCache()


Classe: tx_indexedsearch_indexer  - X-Ref

Indexing class for TYPO3 frontend

hook_indexContent(&$pObj)   X-Ref
Parent Object (TSFE) Initialization

param: object        Parent Object (frontend TSFE object), passed by reference
return: void

backend_initIndexer($id, $type, $sys_language_uid, $MP, $uidRL, $cHash_array=array()   X-Ref
Initializing the "combined ID" of the page (phash) being indexed (or for which external media is attached)

param: integer        The page uid, &id=
param: integer        The page type, &type=
param: integer        sys_language uid, typically &L=
param: string        The MP variable (Mount Points), &MP=
param: array        Rootline array of only UIDs.
param: array        Array of GET variables to register with this indexing
param: boolean        If set, calculates a cHash value from the $cHash_array. Probably you will not do that since such cases are indexed through the frontend and the idea of this interface is to index non-cachable pages from the backend!
return: void

backend_setFreeIndexUid($freeIndexUid, $freeIndexSetId=0)   X-Ref
Sets the free-index uid. Can be called right after backend_initIndexer()

param: integer        Free index UID
param: integer        Set id - an integer identifying the "set" of indexing operations.
return: void

backend_indexAsTYPO3Page($title, $keywords, $description, $content, $charset, $mtime, $crdate=0, $recordUid=0)   X-Ref
Indexing records as the content of a TYPO3 page.

param: string        Title equivalent
param: string        Keywords equivalent
param: string        Description equivalent
param: string        The main content to index
param: string        The charset of the title, keyword, description and body-content. MUST BE VALID, otherwise nothing is indexed!
param: integer        Last modification time, in seconds
param: integer        The creation date of the content, in seconds
param: integer        The record UID that the content comes from (for registration with the indexed rows)
return: void

init()   X-Ref
Initializes the object. $this->conf MUST be set with proper values prior to this call!!!

return: void

initializeExternalParsers()   X-Ref
Initialize external parsers

return: void

indexTypo3PageContent()   X-Ref
Start indexing of the TYPO3 page

return: void

splitHTMLContent($content)   X-Ref
Splits HTML content and returns an associative array, with title, a list of metatags, and a list of words in the body.

param: string        HTML content to index. To some degree expected to be made by TYPO3 (ei. splitting the header by ":")
return: array        Array of content, having keys "title", "body", "keywords" and "description" set.

getHTMLcharset($content)   X-Ref
Extract the charset value from HTML meta tag.

param: string        HTML content
return: string        The charset value if found.

convertHTMLToUtf8($content,$charset='')   X-Ref
Converts a HTML document to utf-8

param: string        HTML content, any charset
param: string        Optional charset (otherwise extracted from HTML)
return: string        Converted HTML

embracingTags($string,$tagName,&$tagContent,&$stringAfter,&$paramList)   X-Ref
Finds first occurence of embracing tags and returns the embraced content and the original string with
the tag removed in the two passed variables. Returns false if no match found. ie. useful for finding
<title> of document or removing <script>-sections

param: string        String to search in
param: string        Tag name, eg. "script"
param: string        Passed by reference: Content inside found tag
param: string        Passed by reference: Content after found tag
param: string        Passed by reference: Attributes of the found tag.
return: boolean        Returns false if tag was not found, otherwise true.

typoSearchTags(&$body)   X-Ref
Removes content that shouldn't be indexed according to TYPO3SEARCH-tags.

param: string        HTML Content, passed by reference
return: boolean        Returns true if a TYPOSEARCH_ tag was found, otherwise false.

extractLinks($content)   X-Ref
Extract links (hrefs) from HTML content and if indexable media is found, it is indexed.

param: string        HTML content
return: void

extractHyperLinks($string)   X-Ref
Extracts all links to external documents from content string.

param: string        Content to analyse
return: array        Array of hyperlinks

indexExternalUrl($externalUrl)   X-Ref
Index External URLs HTML content

param: string        URL, eg. "http://typo3.org/"
return: void

getUrlHeaders($url)   X-Ref
Getting HTTP request headers of URL

param: string        The URL
param: integer        Timeout (seconds?)
return: mixed        If no answer, returns false. Otherwise an array where HTTP headers are keys

indexRegularDocument($file, $force=FALSE, $contentTmpFile='', $altExtension='')   X-Ref
Indexing a regular document given as $file (relative to PATH_site, local file)

param: string        Relative Filename, relative to PATH_site. It can also be an absolute path as long as it is inside the lockRootPath (validated with t3lib_div::isAbsPath()). Finally, if $contentTmpFile is set, this value can be anything, most likely a URL
param: boolean        If set, indexing is forced (despite content hashes, mtime etc).
param: string        Temporary file with the content to read it from (instead of $file). Used when the $file is a URL.
param: string        File extension for temporary file.
return: void

readFileContent($ext,$absFile,$cPKey)   X-Ref
Reads the content of an external file being indexed.
The content from the external parser MUST be returned in utf-8!

param: string        File extension, eg. "pdf", "doc" etc.
param: string        Absolute filename of file (must exist and be validated OK before calling function)
param: string        Pointer to section (zero for all other than PDF which will have an indication of pages into which the document should be splitted.)
return: array        Standard content array (title, description, keywords, body keys)

fileContentParts($ext,$absFile)   X-Ref
Creates an array with pointers to divisions of document.

param: string        File extension
param: string        Absolute filename (must exist and be validated OK before calling function)
return: array        Array of pointers to sections that the document should be divided into

splitRegularContent($content)   X-Ref
Splits non-HTML content (from external files for instance)

param: string        Input content (non-HTML) to index.
return: array        Array of content, having the key "body" set (plus "title", "description" and "keywords", but empty)

charsetEntity2utf8(&$contentArr, $charset)   X-Ref
Convert character set and HTML entities in the value of input content array keys

param: array        Standard content array
param: string        Charset of the input content (converted to utf-8)
return: void

processWordsInArrays($contentArr)   X-Ref
Processing words in the array from split*Content -functions

param: array        Array of content to index, see splitHTMLContent() and splitRegularContent()
return: array        Content input array modified so each key is not a unique array of words

procesWordsInArrays($contentArr)   X-Ref
Processing words in the array from split*Content -functions
This function is only a wrapper because the function has been removed (see above).

param: array        Array of content to index, see splitHTMLContent() and splitRegularContent()
return: array        Content input array modified so each key is not a unique array of words

bodyDescription($contentArr)   X-Ref
Extracts the sample description text from the content array.

param: array        Content array
return: string        Description string

indexAnalyze($content)   X-Ref
Analyzes content to use for indexing,

param: array        Standard content array: an array with the keys title,keywords,description and body, which all contain an array of words.
return: array        Index Array (whatever that is...)

analyzeHeaderinfo(&$retArr,$content,$key,$offset)   X-Ref
Calculates relevant information for headercontent

param: array        Index array, passed by reference
param: array        Standard content array
param: string        Key from standard content array
param: integer        Bit-wise priority to type
return: void

analyzeBody(&$retArr,$content)   X-Ref
Calculates relevant information for bodycontent

param: array        Index array, passed by reference
param: array        Standard content array
return: void

metaphone($word,$retRaw=FALSE)   X-Ref
Creating metaphone based hash from input word

param: string        Word to convert
param: boolean        If set, returns the raw metaphone value (not hashed)
return: mixed        Metaphone hash integer (or raw value, string)

submitPage()   X-Ref
Updates db with information about the page (TYPO3 page, not external media)

return: void

submit_grlist($hash,$phash_x)   X-Ref
Stores gr_list in the database.

param: integer        Search result record phash
param: integer        Actual phash of current content
return: void

submit_section($hash,$hash_t3)   X-Ref
Stores section
$hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files.

param: integer        phash of TYPO3 parent search result record
param: integer        phash of the file indexation search record
return: void

removeOldIndexedPages($phash)   X-Ref
Removes records for the indexed page, $phash

param: integer        phash value to flush
return: void

submitFilePage($hash,$file,$subinfo,$ext,$mtime,$ctime,$size,$content_md5h,$contentParts)   X-Ref
Updates db with information about the file

param: array        Array with phash and phash_grouping keys for file
param: string        File name
param: array        Array of "cHashParams" for files: This is for instance the page index for a PDF file (other document types it will be a zero)
param: string        File extension determining the type of media.
param: integer        Modification time of file.
param: integer        Creation time of file.
param: integer        Size of file in bytes
param: integer        Content HASH value.
param: array        Standard content array (using only title and body for a file)
return: void

submitFile_grlist($hash)   X-Ref
Stores file gr_list for a file IF it does not exist already

param: integer        phash value of file
return: void

submitFile_section($hash)   X-Ref
Stores file section for a file IF it does not exist

param: integer        phash value of file
return: void

removeOldIndexedFiles($phash)   X-Ref
Removes records for the indexed page, $phash

param: integer        phash value to flush
return: void

checkMtimeTstamp($mtime,$phash)   X-Ref
Check the mtime / tstamp of the currently indexed page/file (based on phash)
Return positive integer if the page needs to be indexed

param: integer        mtime value to test against limits and indexed page (usually this is the mtime of the cached document)
param: integer        "phash" used to select any already indexed page to see what its mtime is.
return: integer        Result integer: Generally: <0 = No indexing, >0 = Do indexing (see $this->reasons): -2) Min age was NOT exceeded and so indexing cannot occur.  -1) mtime matched so no need to reindex page. 0) N/A   1) Max age exceeded, page must be indexed again.   2) mtime of indexed page doesn't match mtime given for current content and we must index page.  3) No mtime was set, so we will index...  4) No indexed page found, so of course we will index.

checkContentHash()   X-Ref
Check content hash in phash table

return: mixed        Returns true if the page needs to be indexed (that is, there was no result), otherwise the phash value (in an array) of the phash record to which the grlist_record should be related!

checkExternalDocContentHash($hashGr,$content_md5h)   X-Ref
Check content hash for external documents
Returns true if the document needs to be indexed (that is, there was no result)

param: integer        phash value to check (phash_grouping)
param: integer        Content hash to check
return: boolean        Returns true if the document needs to be indexed (that is, there was no result)

is_grlist_set($phash_x)   X-Ref
Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page)

param: integer        Phash integer to test.
return: void

update_grlist($phash,$phash_x)   X-Ref
Check if an grlist-entry for this hash exists and if not so, write one.

param: integer        phash of the search result that should be found
param: integer        The real phash of the current content. The two values are different when a page with userlogin turns out to contain the exact same content as another already indexed version of the page; This is the whole reason for the grlist table in fact...
return: void

updateTstamp($phash,$mtime=0)   X-Ref
Update tstamp for a phash row.

param: integer        phash value
param: integer        If set, update the mtime field to this value.
return: void

updateSetId($phash)   X-Ref
Update SetID of the index_phash record.

param: integer        phash value
return: void

updateParsetime($phash,$parsetime)   X-Ref
Update parsetime for phash row.

param: integer        phash value.
param: integer        Parsetime value to set.
return: void

updateRootline()   X-Ref
Update section rootline for the page

return: void

getRootLineFields(&$fieldArr)   X-Ref
Adding values for root-line fields.
rl0, rl1 and rl2 are standard. A hook might add more.

param: array        Field array, passed by reference
return: void

removeLoginpagesWithContentHash()   X-Ref
Removes any indexed pages with userlogins which has the same contentHash
NOT USED anywhere inside this class!

return: void

includeCrawlerClass()   X-Ref
Includes the crawler class

return: void

checkWordList($wl)   X-Ref
Adds new words to db

param: array        Word List array (where each word has information about position etc).
return: void

submitWords($wl,$phash)   X-Ref
Submits RELATIONS between words and phash

param: array        Word list array
param: integer        phash value
return: void

freqMap($freq)   X-Ref
maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1
and back.

param: double        Frequency
return: integer        Frequency in range.

setT3Hashes()   X-Ref
Get search hash, T3 pages

return: void

setExtHashes($file,$subinfo=array()   X-Ref
Get search hash, external files

param: string        File name / path which identifies it on the server
param: array        Additional content identifying the (subpart of) content. For instance; PDF files are divided into groups of pages for indexing.
return: array        Array with "phash_grouping" and "phash" inside.

md5inthash($str)   X-Ref
md5 integer hash
Using 7 instead of 8 just because that makes the integers lower than 32 bit (28 bit) and so they do not interfere with UNSIGNED integers or PHP-versions which has varying output from the hexdec function.

param: string        String to hash
return: integer        Integer intepretation of the md5 hash of input string.

makeCHash($paramArray)   X-Ref
Calculates the cHash value of input GET array (for constructing cHash values if needed)

param: array        Array of GET parameters to encode
return: void

log_push($msg,$key)   X-Ref
Push function wrapper for TT logging

param: string        Title to set
param: string        Key (?)
return: void

log_pull()   X-Ref
Pull function wrapper for TT logging

return: void

log_setTSlogMessage($msg, $errorNum=0)   X-Ref
Set log message function wrapper for TT logging

param: string        Message to set
param: integer        Error number
return: void

fe_headerNoCache(&$params, $ref)   X-Ref
Frontend hook: If the page is not being re-generated this is our chance to force it to be (because re-generation of the page is required in order to have the indexer called!)

param: array        Parameters from frontend
param: object        TSFE object (reference under PHP5)
return: void



Généré le : Sun Nov 25 17:13:16 2007 par Balluche grâce à PHPXref 0.7
  Clicky Web Analytics