[ Index ] |
|
Code source de Typo3 4.1.3 |
[Code source] [Imprimer] [Statistiques]
This class is a search indexer for TYPO3
Author: | Kasper Skårhøj <kasperYYYY@typo3.com> |
Poids: | 2071 lignes (73 kb) |
Inclus ou requis: | 0 fois |
Référencé: | 0 fois |
Nécessite: | 0 fichiers |
tx_indexedsearch_indexer:: (59 méthodes):
hook_indexContent()
backend_initIndexer()
backend_setFreeIndexUid()
backend_indexAsTYPO3Page()
init()
initializeExternalParsers()
indexTypo3PageContent()
splitHTMLContent()
getHTMLcharset()
convertHTMLToUtf8()
embracingTags()
typoSearchTags()
extractLinks()
extractHyperLinks()
indexExternalUrl()
getUrlHeaders()
indexRegularDocument()
readFileContent()
fileContentParts()
splitRegularContent()
charsetEntity2utf8()
processWordsInArrays()
procesWordsInArrays()
bodyDescription()
indexAnalyze()
analyzeHeaderinfo()
analyzeBody()
metaphone()
submitPage()
submit_grlist()
submit_section()
removeOldIndexedPages()
submitFilePage()
submitFile_grlist()
submitFile_section()
removeOldIndexedFiles()
checkMtimeTstamp()
checkContentHash()
checkExternalDocContentHash()
is_grlist_set()
update_grlist()
updateTstamp()
updateSetId()
updateParsetime()
updateRootline()
getRootLineFields()
removeLoginpagesWithContentHash()
includeCrawlerClass()
checkWordList()
submitWords()
freqMap()
setT3Hashes()
setExtHashes()
md5inthash()
makeCHash()
log_push()
log_pull()
log_setTSlogMessage()
fe_headerNoCache()
Classe: tx_indexedsearch_indexer - X-Ref
Indexing class for TYPO3 frontendhook_indexContent(&$pObj) X-Ref |
Parent Object (TSFE) Initialization param: object Parent Object (frontend TSFE object), passed by reference return: void |
backend_initIndexer($id, $type, $sys_language_uid, $MP, $uidRL, $cHash_array=array() X-Ref |
Initializing the "combined ID" of the page (phash) being indexed (or for which external media is attached) param: integer The page uid, &id= param: integer The page type, &type= param: integer sys_language uid, typically &L= param: string The MP variable (Mount Points), &MP= param: array Rootline array of only UIDs. param: array Array of GET variables to register with this indexing param: boolean If set, calculates a cHash value from the $cHash_array. Probably you will not do that since such cases are indexed through the frontend and the idea of this interface is to index non-cachable pages from the backend! return: void |
backend_setFreeIndexUid($freeIndexUid, $freeIndexSetId=0) X-Ref |
Sets the free-index uid. Can be called right after backend_initIndexer() param: integer Free index UID param: integer Set id - an integer identifying the "set" of indexing operations. return: void |
backend_indexAsTYPO3Page($title, $keywords, $description, $content, $charset, $mtime, $crdate=0, $recordUid=0) X-Ref |
Indexing records as the content of a TYPO3 page. param: string Title equivalent param: string Keywords equivalent param: string Description equivalent param: string The main content to index param: string The charset of the title, keyword, description and body-content. MUST BE VALID, otherwise nothing is indexed! param: integer Last modification time, in seconds param: integer The creation date of the content, in seconds param: integer The record UID that the content comes from (for registration with the indexed rows) return: void |
init() X-Ref |
Initializes the object. $this->conf MUST be set with proper values prior to this call!!! return: void |
initializeExternalParsers() X-Ref |
Initialize external parsers return: void |
indexTypo3PageContent() X-Ref |
Start indexing of the TYPO3 page return: void |
splitHTMLContent($content) X-Ref |
Splits HTML content and returns an associative array, with title, a list of metatags, and a list of words in the body. param: string HTML content to index. To some degree expected to be made by TYPO3 (ei. splitting the header by ":") return: array Array of content, having keys "title", "body", "keywords" and "description" set. |
getHTMLcharset($content) X-Ref |
Extract the charset value from HTML meta tag. param: string HTML content return: string The charset value if found. |
convertHTMLToUtf8($content,$charset='') X-Ref |
Converts a HTML document to utf-8 param: string HTML content, any charset param: string Optional charset (otherwise extracted from HTML) return: string Converted HTML |
embracingTags($string,$tagName,&$tagContent,&$stringAfter,&$paramList) X-Ref |
Finds first occurence of embracing tags and returns the embraced content and the original string with the tag removed in the two passed variables. Returns false if no match found. ie. useful for finding <title> of document or removing <script>-sections param: string String to search in param: string Tag name, eg. "script" param: string Passed by reference: Content inside found tag param: string Passed by reference: Content after found tag param: string Passed by reference: Attributes of the found tag. return: boolean Returns false if tag was not found, otherwise true. |
typoSearchTags(&$body) X-Ref |
Removes content that shouldn't be indexed according to TYPO3SEARCH-tags. param: string HTML Content, passed by reference return: boolean Returns true if a TYPOSEARCH_ tag was found, otherwise false. |
extractLinks($content) X-Ref |
Extract links (hrefs) from HTML content and if indexable media is found, it is indexed. param: string HTML content return: void |
extractHyperLinks($string) X-Ref |
Extracts all links to external documents from content string. param: string Content to analyse return: array Array of hyperlinks |
indexExternalUrl($externalUrl) X-Ref |
Index External URLs HTML content param: string URL, eg. "http://typo3.org/" return: void |
getUrlHeaders($url) X-Ref |
Getting HTTP request headers of URL param: string The URL param: integer Timeout (seconds?) return: mixed If no answer, returns false. Otherwise an array where HTTP headers are keys |
indexRegularDocument($file, $force=FALSE, $contentTmpFile='', $altExtension='') X-Ref |
Indexing a regular document given as $file (relative to PATH_site, local file) param: string Relative Filename, relative to PATH_site. It can also be an absolute path as long as it is inside the lockRootPath (validated with t3lib_div::isAbsPath()). Finally, if $contentTmpFile is set, this value can be anything, most likely a URL param: boolean If set, indexing is forced (despite content hashes, mtime etc). param: string Temporary file with the content to read it from (instead of $file). Used when the $file is a URL. param: string File extension for temporary file. return: void |
readFileContent($ext,$absFile,$cPKey) X-Ref |
Reads the content of an external file being indexed. The content from the external parser MUST be returned in utf-8! param: string File extension, eg. "pdf", "doc" etc. param: string Absolute filename of file (must exist and be validated OK before calling function) param: string Pointer to section (zero for all other than PDF which will have an indication of pages into which the document should be splitted.) return: array Standard content array (title, description, keywords, body keys) |
fileContentParts($ext,$absFile) X-Ref |
Creates an array with pointers to divisions of document. param: string File extension param: string Absolute filename (must exist and be validated OK before calling function) return: array Array of pointers to sections that the document should be divided into |
splitRegularContent($content) X-Ref |
Splits non-HTML content (from external files for instance) param: string Input content (non-HTML) to index. return: array Array of content, having the key "body" set (plus "title", "description" and "keywords", but empty) |
charsetEntity2utf8(&$contentArr, $charset) X-Ref |
Convert character set and HTML entities in the value of input content array keys param: array Standard content array param: string Charset of the input content (converted to utf-8) return: void |
processWordsInArrays($contentArr) X-Ref |
Processing words in the array from split*Content -functions param: array Array of content to index, see splitHTMLContent() and splitRegularContent() return: array Content input array modified so each key is not a unique array of words |
procesWordsInArrays($contentArr) X-Ref |
Processing words in the array from split*Content -functions This function is only a wrapper because the function has been removed (see above). param: array Array of content to index, see splitHTMLContent() and splitRegularContent() return: array Content input array modified so each key is not a unique array of words |
bodyDescription($contentArr) X-Ref |
Extracts the sample description text from the content array. param: array Content array return: string Description string |
indexAnalyze($content) X-Ref |
Analyzes content to use for indexing, param: array Standard content array: an array with the keys title,keywords,description and body, which all contain an array of words. return: array Index Array (whatever that is...) |
analyzeHeaderinfo(&$retArr,$content,$key,$offset) X-Ref |
Calculates relevant information for headercontent param: array Index array, passed by reference param: array Standard content array param: string Key from standard content array param: integer Bit-wise priority to type return: void |
analyzeBody(&$retArr,$content) X-Ref |
Calculates relevant information for bodycontent param: array Index array, passed by reference param: array Standard content array return: void |
metaphone($word,$retRaw=FALSE) X-Ref |
Creating metaphone based hash from input word param: string Word to convert param: boolean If set, returns the raw metaphone value (not hashed) return: mixed Metaphone hash integer (or raw value, string) |
submitPage() X-Ref |
Updates db with information about the page (TYPO3 page, not external media) return: void |
submit_grlist($hash,$phash_x) X-Ref |
Stores gr_list in the database. param: integer Search result record phash param: integer Actual phash of current content return: void |
submit_section($hash,$hash_t3) X-Ref |
Stores section $hash and $hash_t3 are the same for TYPO3 pages, but different when it is external files. param: integer phash of TYPO3 parent search result record param: integer phash of the file indexation search record return: void |
removeOldIndexedPages($phash) X-Ref |
Removes records for the indexed page, $phash param: integer phash value to flush return: void |
submitFilePage($hash,$file,$subinfo,$ext,$mtime,$ctime,$size,$content_md5h,$contentParts) X-Ref |
Updates db with information about the file param: array Array with phash and phash_grouping keys for file param: string File name param: array Array of "cHashParams" for files: This is for instance the page index for a PDF file (other document types it will be a zero) param: string File extension determining the type of media. param: integer Modification time of file. param: integer Creation time of file. param: integer Size of file in bytes param: integer Content HASH value. param: array Standard content array (using only title and body for a file) return: void |
submitFile_grlist($hash) X-Ref |
Stores file gr_list for a file IF it does not exist already param: integer phash value of file return: void |
submitFile_section($hash) X-Ref |
Stores file section for a file IF it does not exist param: integer phash value of file return: void |
removeOldIndexedFiles($phash) X-Ref |
Removes records for the indexed page, $phash param: integer phash value to flush return: void |
checkMtimeTstamp($mtime,$phash) X-Ref |
Check the mtime / tstamp of the currently indexed page/file (based on phash) Return positive integer if the page needs to be indexed param: integer mtime value to test against limits and indexed page (usually this is the mtime of the cached document) param: integer "phash" used to select any already indexed page to see what its mtime is. return: integer Result integer: Generally: <0 = No indexing, >0 = Do indexing (see $this->reasons): -2) Min age was NOT exceeded and so indexing cannot occur. -1) mtime matched so no need to reindex page. 0) N/A 1) Max age exceeded, page must be indexed again. 2) mtime of indexed page doesn't match mtime given for current content and we must index page. 3) No mtime was set, so we will index... 4) No indexed page found, so of course we will index. |
checkContentHash() X-Ref |
Check content hash in phash table return: mixed Returns true if the page needs to be indexed (that is, there was no result), otherwise the phash value (in an array) of the phash record to which the grlist_record should be related! |
checkExternalDocContentHash($hashGr,$content_md5h) X-Ref |
Check content hash for external documents Returns true if the document needs to be indexed (that is, there was no result) param: integer phash value to check (phash_grouping) param: integer Content hash to check return: boolean Returns true if the document needs to be indexed (that is, there was no result) |
is_grlist_set($phash_x) X-Ref |
Checks if a grlist record has been set for the phash value input (looking at the "real" phash of the current content, not the linked-to phash of the common search result page) param: integer Phash integer to test. return: void |
update_grlist($phash,$phash_x) X-Ref |
Check if an grlist-entry for this hash exists and if not so, write one. param: integer phash of the search result that should be found param: integer The real phash of the current content. The two values are different when a page with userlogin turns out to contain the exact same content as another already indexed version of the page; This is the whole reason for the grlist table in fact... return: void |
updateTstamp($phash,$mtime=0) X-Ref |
Update tstamp for a phash row. param: integer phash value param: integer If set, update the mtime field to this value. return: void |
updateSetId($phash) X-Ref |
Update SetID of the index_phash record. param: integer phash value return: void |
updateParsetime($phash,$parsetime) X-Ref |
Update parsetime for phash row. param: integer phash value. param: integer Parsetime value to set. return: void |
updateRootline() X-Ref |
Update section rootline for the page return: void |
getRootLineFields(&$fieldArr) X-Ref |
Adding values for root-line fields. rl0, rl1 and rl2 are standard. A hook might add more. param: array Field array, passed by reference return: void |
removeLoginpagesWithContentHash() X-Ref |
Removes any indexed pages with userlogins which has the same contentHash NOT USED anywhere inside this class! return: void |
includeCrawlerClass() X-Ref |
Includes the crawler class return: void |
checkWordList($wl) X-Ref |
Adds new words to db param: array Word List array (where each word has information about position etc). return: void |
submitWords($wl,$phash) X-Ref |
Submits RELATIONS between words and phash param: array Word list array param: integer phash value return: void |
freqMap($freq) X-Ref |
maps frequency from a real number in [0;1] to an integer in [0;$this->freqRange] with anything above $this->freqMax as 1 and back. param: double Frequency return: integer Frequency in range. |
setT3Hashes() X-Ref |
Get search hash, T3 pages return: void |
setExtHashes($file,$subinfo=array() X-Ref |
Get search hash, external files param: string File name / path which identifies it on the server param: array Additional content identifying the (subpart of) content. For instance; PDF files are divided into groups of pages for indexing. return: array Array with "phash_grouping" and "phash" inside. |
md5inthash($str) X-Ref |
md5 integer hash Using 7 instead of 8 just because that makes the integers lower than 32 bit (28 bit) and so they do not interfere with UNSIGNED integers or PHP-versions which has varying output from the hexdec function. param: string String to hash return: integer Integer intepretation of the md5 hash of input string. |
makeCHash($paramArray) X-Ref |
Calculates the cHash value of input GET array (for constructing cHash values if needed) param: array Array of GET parameters to encode return: void |
log_push($msg,$key) X-Ref |
Push function wrapper for TT logging param: string Title to set param: string Key (?) return: void |
log_pull() X-Ref |
Pull function wrapper for TT logging return: void |
log_setTSlogMessage($msg, $errorNum=0) X-Ref |
Set log message function wrapper for TT logging param: string Message to set param: integer Error number return: void |
fe_headerNoCache(&$params, $ref) X-Ref |
Frontend hook: If the page is not being re-generated this is our chance to force it to be (because re-generation of the page is required in order to have the indexer called!) param: array Parameters from frontend param: object TSFE object (reference under PHP5) return: void |
Généré le : Sun Nov 25 17:13:16 2007 | par Balluche grâce à PHPXref 0.7 |
![]() |